 impacted your behavior on some level and he argued that this genetic factor was correlated with both smoking behaviors and lung cancer and that this genetic factor was what was really causing lung cancer and was creating this this kind of fake association this fake correlation between smoking and lung cancer and my question to you is how would you test this theory how would you examine whether this genetic factor was truly causing lung cancer and smokers as opposed to smoking itself so with these two motivating questions in mind our outline today we're going to discuss some fundamental concepts of causal inference is more like a philosophical discussion so I know afternoon maybe not the best time to talk philosophy but we'll be talking about philosophy then we're going to talk about how we visualize causal relationships and how we can create diagrams that allow us to to show assumptions and then test those assumptions using DAGs or directed acyclic graphs I'll go through some conceptual areas of bias and confounding we'll look at some common study designs and what their implications are for causal inference and how they might affect how you think about causal inference in the studies that you are conducting and then briefly we'll go over some tools in our and how we how we can use these concepts in our research using our and bio conductor so I'm going to start by throwing this question out to the room in your own words what is a cause what does it mean to cause something and people can put this in chat or they can call it out so if a factor without without its presence you know something else doesn't happen yeah that's a great answer yep and it's hitting a key point here which is like it's necessary right for something to happen yeah okay yeah yeah so there's an input and then some kind of result from that input that we can we can define and I think what you're getting there's that there's a consistency to it as well that we have a consistent set of inputs we should expect a consistent set of outputs maybe more or less right is there is there anything in the chat okay any other thoughts okay I think those are both two two great points and I think this is hard to think of because I think intuitively we know what it means to cause something in our daily lives like I know when I you know I pick up this water bottle I have caused it to be lifted by putting my hand on it and exerting force on it but it becomes a little more difficult when we think about it in terms of of research when we're a little more abstracted when we're thinking on the the you know like microbiome level or health human health level and it becomes more complicated because we we can't often observe it directly and so I give an example here where I say you know like when I burn my finger while cooking I know what caused it you know my finger came in contact with a hot stove and I can I can think about a causal pathway that might have caused me to come in contact with the hot stove maybe it's the fact that I was cooking dinner so the motivation to cook dinner you know having the ingredients on hand to cook dinner those were all necessary components to me burning my finger as well as like maybe I was distracted by my cat so we can think about this long causal pathway of all these variables and it gets truly overwhelming at a certain point which is why we need to start thinking about how can we define things very succinctly and very concretely as we switch over to you know health research bioinformatics and so just to start getting people thinking what are some causal questions either you're interested in personally or you think as a field bioinformaticians may be interested in within the context yeah bioinformatics or your field and I'll start with one you know I gave the classic epi example of smoking and lung cancer but a more modern example that is still really contentious in epidemiological studies is the role vitamin D and whether that is protective in any way but whether giving people vitamin D supplements is actually helping them in any way and there's been tons of studies back and forth and no one is really quite sure on the role of vitamin D yet so that's one question that that is still very open in epidemiology so does anyone have some examples yeah that's a great one we just saw this very play out where a lot of people you know were very adamant that the vaccine was not helpful in fact that it was harmful so there is a causal question there which is does the vaccine protect people or does it harm people I think we're pretty sure it helps people overall but some people are very convinced that the opposite is true and so how do we provide evidence that that it in fact is helpful which can be very difficult when we're in a very you know fast-moving pandemic variants emerging and you know running clinical trials takes a while any other examples okay we'll move on so because this seems very overwhelming as I was mentioning there's so many different variables and you know even just defining what is the cause can be very difficult to do what people talk about is the counterfactual definition of a cause what this means is we rely on inferences based on what we've observed to imagine what would have happened had the event of interest not occurred so to put this in another way imagine a world where everything is the same except I I didn't get distracted by my cat would I still burn my finger or to put that in another example even imagine you have a group of participants and you let's say like a hundred participants you assign all hundred of them to smoke see if they get lung cancer and then you invent a time machine you travel back in time and you look go to the same 100 participants and say hey none of you smoke now and then you see how many of them get lung cancer and this alternate timeline that we've created of course we can't time travel we can't create multiple dimensions you know we're not Dr. Strange or Wanda or whatever from Marvel so we can't do that right but we can maybe imagine that maybe we can design a study to try mimic that so just to get a more concrete example this is one I think about being in New York City a lot or living in New York City is that you know there's a lot of noise a lot of ambulances driving by and often you'll see an you're hearing ambulance drive by and then all the dogs start start barking right and so if we imagine an ambulance driving by a dog starts barking we can ask the cause of question in you know like a normal way did the ambulance driving by the house cause the dog to bark but we can then rephrase that as the counterfactual question which is would the dog have barked if the ambulance had not driven by the house that's a little confusing but you can just imagine like if if that ambulance had not been there would the dog have barked and that's the question we we don't know the answer without a time machine or multiverses right necessarily unless we can design a clever study that mimics that and so another example vaccines you know does the vaccine increase the risk of blood clots we could rephrase as a counterfactual would people have developed blood clots if they had not been given the vaccine and that that's a much more difficult question to answer right because we're not sure you know it could just be a fact of aging or stress or we don't know but so I I'm gonna I think I just did yours because the next question is using the causal question from bioinformatics or your field reframing in terms of counterfactual so that's kind of what we just did but does anyone have another example of reframing a question as a counterfactual yeah that yeah that's a great example so if you take aspirin for a headache you could ask would I have would that or would I would the headache have gone away if I had taken the aspirin or not like did the aspirin actually help me basically and and we're not sure we have maybe we have to think about how to design a study to to look at that but I think this is a lot of words and I know we like diagrams and plots so we're gonna switch over to actually thinking about how we can represent this in a diagram so there are graphs called DAGs Professor Makaji mentioned one of these I think and if you're wondering what she meant by this they are directed a cyclic graph so directed in that every relationship should have an arrow pointing one way and they're a cyclic because there should not be cycles in them they should kind of just go in one direction and I have a very simple example of one right here for our ambulance and dog barking examples so this is that the causal kind of relationship that perhaps we're proposing which is that the ambulance causes which is represented by the arrow the dog to bark so the diagram depicts the cause or the exposure the ambulance and the effect or outcome however you want to call it and the arrow is the relationship of the causal relationship and the direction of it so if we if this arrow was pointing the other way we'd be proposing that the dog barking was causing the ambulance siren which I don't think is very likely but I mean perhaps there's a situation where that is true so in contrast to a diagram showing statistical relationships causal diagrams must state a directional relationship because we are theorizing that one variable causes the other so we're not we're not in the realm of correlation here where we find a correlation between you know murders and ice cream trucks and we just think about that but like no we are saying you know one variable is causing the other that's what we're theorizing here so this is the simple form of a causal diagram we only have two variables we have a very simple relationship here and if we wanted to study this relationship you know we could make a very simple two variable regression model where our our outcome or dependent variable is dog sparking and our our independent variable is the ambulance is driving by like we could do a study of it I suppose and collect data from out the city but we just have two variables however we are rarely rarely rarely that lucky to have you know just that simple relationship in a lot of our research especially observational research can you think of any situations that may bias the observed relationship between the ambulance and the dog barking I know this is tough I'm asking you tough questions after lunch oh do we have something okay there's a question okay from Ryan Thompson true causal relationships are not limited to being acyclic though right the acyclic assumption assumption we are making to simplify the math yeah so the question is like could a causation should be you know like could there be a cycle in it and that that's true there are feedback loops so there can be cycles but directed acyclic graphs are by definition acyclic but yes and there's a lot of math underpinning this that I'm not going to go into but yeah so that is that does simplify the math so that's a good point and I saw someone else in chat put explosions as their answer so or an explosion and that's a great example so like maybe there's an explosion maybe like I don't know like a car backfires or maybe there's a car accident and they crash and there's you know like I don't know something that causes the dog to bark and so and that causes the ambulance to be there so maybe someone you know is in this accident and got hurt the accident caused the dog to start barking and the accident is why the ambulance is in the area so there's a third factor a confounder of the relationship between the ambulance and the dog barking yeah so confounding is a type of bias so we'll talk about it in a in a moment because we're going to talk about more about making dags but confounding biases the observed relationship between the exposure and the outcome you're shaking your head do you disagree with that okay okay yeah that I that could be part of it too yeah could be you know maybe you're getting I'm thinking about that let me think about that where you live that could almost be like generalized ability like is it generalized it's my findings from my ambulance and dog barking study generalizable to a small town when I did the study in New York City that's an interesting way of thinking about it but why don't we talk about making dags so we can make dags in our we can also draw them by hand if you want or I think very commonly dags are drawn on whiteboards and then you take a photo of them and then I don't know but that's what most commonly happens when I ask for someone to help me with a dag but there's a lot of software options out there if you have very bad handwriting like I do and that you can even make them an R so we'll look at a few of those options so the first one is drawing it by hand but the second one is this great program called daggity d-a-g-i-t-t-y dot net and you can go here so this is developed by by a team at the University of Lubrecht in Germany and you've got a few different options here you can see that already that there's an R package but we'll look at the online tool real quick so we'll launch that so we get this nice kind of example that they've given us here but we'll make our own model so we'll go up here to model to new model and we just have this blank screen and so how we can add a variable to this so let's stick with let's say smoking and lung cancer we can just click anywhere on the gray screen and we'll get this and we can say smoking so that's a new variable we've added and we can add lung cancer so we've got our two variables and then to connect them we just click on smoking click on lung cancer so we've created a causal relationship between these two variables furthermore we can go click on smoking and then denote it to be our exposure by checking the exposure box and then click on lung cancer and click outcome to show that that's the outcome and so the exposure is this yellow of the triangle the like play symbol and the outcome is this blue circle with the just the line I guess in it and if you make a mistake you know like you can you can accidentally make the area arrow it will have double headed arrows but you can just click on them multiple times to to turn it off or if you wanted to go the other way so this would be if lung cancer cause smoking so we can just you just click on the two variables and you can move the arrows in the direction you want them to be so you just click on one variable and click on the next one so let me take this off real quick so just to show you one more time you just click on smoking click on that oval click on lung cancer and then the arrow should appear yep yeah good question it's a little confusing because you think you need to like draw the line or something you just click on the two variables and so now let's add what Fisher posited which was that there was a genetic factor so his argument was that there was some genetic factor some third factor that was influencing both people to smoke and causing lung cancer and so he proposed that this genetic factor caused people to smoke so we'll put an arrow there and that it caused lung cancer and we can see that this variable turned red that's because it's confounding the relationship between smoking and lung cancer which we'll talk about what a confounder means in a moment but we can see here that it's turned red it knows that that's bad that like we would need to adjust for that in some way and in fact in the top right corner we can only see it tells us you know if you want to estimate the total effect of smoking on lung cancer you need to adjust for genetic factors while we're here are there other variables that you think are important to this relationship age very important one imagine age both influences whether someone smokes prior also influences whether they have lung cancer so we can add that other variables sorry yeah gender we can imagine that probably influences both as well can you think of any variables that might influence only one of these yeah perfect it's like environmental let's call pollution that's that's pretty succinct I can spell pollution okay social environment so we can call that like this SES and then I imagine that that probably actually influences both of them oh yeah yep so SES would be connected to pollution it's all good point any other variables what when um that's a good question what if your diagram is wrong what what do you think happens yeah so I think what I would say is that you use this diagram to design you know your hypotheses the expert the test that you're doing and therefore your assumption is that you have specified your dad correctly and so if you've specified it incorrectly it could result in biased results and so that is and and as you are seeing this is kind of an art as much as it is a science because I'm sure if we got someone in here who is you know an expert in lung cancer they could probably throw out 10 or 15 more variables that we we haven't considered so it is important that's it you know you want to specify your dad correctly based on subject matter expertise prior research you know the more information just like a lot of things in research the more information that we can put in here the more useful the end product that we get out so you know I think of you know again this is still a pretty simple dad but if you're doing something even more complicated it can get pretty complicated okay got another question this because people are still confused like why you need this duck and it may be erroneous okay there's some Roshika who asks when confounders are unknown or not all are known how do you ensure the accuracy or count for that limitation I mean that that is a limitation of kind of any research if if we're doing observational research is that there might be confounders that are unmeasured unknown unspecified incorrectly measured and and that's that is a limitation that's something we have to be aware of something that a DAC I think also helps us do is that we can imagine you know variables that perhaps we didn't collect this comes up a lot I think in a lot of studies that I've worked on where we realized like oh shoot you know we didn't collect a good measure of you know this variable and actually you know it might be it might be important and so knowing that that we don't have those data that is a limitation that makes us aware for future research how we might make our research better isn't clean that variable so yeah if if yeah so it does help you identify you know the limitations of what you're doing and the point here is not to have the perfect gag because I don't think you will ever quite you know there's so many little factors like that influence you know lung cancer like even if we specified you know hundreds of variables you know we still wouldn't be able to predict with perfect 100% accuracy I don't think whether someone would get lung cancer or not and so the goal here isn't necessarily to have the perfect model it's to have a model that we feel confident in that we can then use to design our study or to if the data already collected is often case at least specify the model variables correctly and then you know talk about and think about like what we could collect in the future or how we could better design the study in the future so the goal here is not perfection it's thinking thinking through the assumptions that you are making thinking through the variables that you think affect this relationship and how you can adjust for them so that's a good question any other questions and we'll get more into like confounding and adjustment in a little bit so this is just doing this using this this web interface but perhaps we want to you know do this in our like we're here this is a bioconductor conference we want to do stuff in our right so you'll notice down here there's this nice little code snippet here and we can just hopefully select all that and we can take that into our and there are two packages that we can look at so the first is Daggerty Daggerty itself has an R package and we can say okay I want to specify a dag from Daggerty and then I just put this in single quotation marks and then I can tell it I can wrap this in a plot hopefully that works and and there's our there's the model we just made and the nice thing is that even can keeps the positions that you set so everything's ready to go it is now in R and you can play with it you know you can see there's these coordinates so if you really wanted to mess with it and you know make it perfect you can definitely do that so this is this package Daggerty is relatively simple and it kind of uses its own way of thinking of things so someone was like well you know Daggerty is good but what if we did something that that's a little more tidy and so that's where this GG dag comes into play this uses the same kind of syntax as GG plots so and and tight and tidy models and any other kind of tidy package where we can then specify our dag using you know kind of a more familiar our model syntax where we've got our dependent variable stop doing that dependent variable on the left and then our independent variables on the right with a tilde separating them and then we have like a plus sign so this is a little simpler one just with smoking lung cancer and the genetic factor will create that create that object we can then look at like the tidy storage of that object so you can see it it's kind of set its own X and Y coordinates it's a little better at deciding positioning so you don't need to specify it you can see it shows the the arrows that are are specified so genetic factor is our first variable here I should make this a little bigger I'm guessing this is really small is that a little better probably a little better so our first factor here is genetic factor we can see it it's gone arrow that leads to lung cancer and then genetic factor has a arrow that leads to smoking and then finally we've got smoking leading to lung cancer obviously that's you know kind of annoying to try and read a dag through that so instead we can use this GG dag function which also allows us to use you know the normal GG plot you know commands like we can set themes we can play with plots you can you can put other you can overlay it over other plots you know anything you can do a GG plot you can you can do a GG dag and so here we end up with a little simpler than the one we are looking at but I didn't want to spend all the time recreating it but we can see we've got smoking lung cancer and our genetic factor there are some other cool things you can do with once you're in here with GG dag so you can also ask it to tell you what you need to adjust for in your model similar to how daggy was telling that so that's using this GG dag adjustment set function and so if we run that here it tells us you know if we want to find the effect of smoking on lung cancer we need to adjust for genetic factor and the nice thing about this also it's really easy to add another like if we think remember another confounder so let's think of age so we can just add age here so this is the the dagify again it follows kind of just that very standard R model formula and then we think there's also an error error from age to smoking so we can just add that real quick and now we've had we've added age and if we look at the adjustment set we now need to adjust for age so really straightforward really easy but a great way to make these visualizations and I'm sure at this point you're saying like okay Chloe you know we've been going for nearly 30 minutes more than 35 minutes what why do I actually need to make these tags you know like what do these tags tell me well what this dag is telling us this one here is what we need to adjust first these two are confounders what is a confounder I think this term has been used a lot I was taught the incorrect definition of this term when I was in in my master's program so and and there is a there's a mathematical definition but I'm gonna stick to the kind of common definition so the confounder is a variable a third variable present in the relationship between a cause and effect and a confounder is present when the confounder causes the outcome can confounder causes the exposure and the confounder is not a mediator that is to say it's not in between you know the exposure and the outcome on the causal pathway I was taught you know like a very much lucid definition I was taught was just like associated I wasn't told anything about causality so I just want to be clear that it's it needs to be causal and it's and often you know people would use data-driven approaches to identify confounders which they'll use the same data that they are analyzing for for a causal relationship and that's considered bad practice it's supposed to be theory-driven so we should build our dag first based on our knowledge and then use that to identify confounders which we then test for so returning to our example going to not do that at some point of the Amos and the Barking dog as mentioned there could be a confounder of car crash so what happens what is a confounder actually doing that that is bad here while it's it's creating what is called a backdoor pathway so if we imagine the pathway of interest being the pathway this arrow that leads from ambulance to dog barking what the confounder does is it opens an additional pathway between ambulance through car crash to dog barking and that what that does is that creates bias it could bias your estimate either toward or away from the null depending on the exact nature of the confounder but it will introduce bias and and the magnitude that bias can vary it could be very big it could be very small so in the example of smoking and lung cancer one of the ways they they argued against Fisher's argument of a genetic factor was that the genetic factor would have had to account for what was it like a nine-fold increase in lung cancer among people smoking and at that point they there was no other genetic factor that had been identified that was anywhere close to to increasing you know the odds of lung cancer that greatly so just thinking about confounders are quick I just want to ask what are some common confounders in bio informatics or health research or your own research we've already talked about something like age socioeconomic status but can you think of some others yes batch effects that's a big one you know if all the samples for the cases are processed at one lab because they're hospitalized but the controls come from the community in their process that a different lab or a different time at a different temperature that could introduce a confounding effect any other ones yeah yeah clinic sites kind of yeah if if you have different you know sites were different that could that could confound the relationship as well if it's affecting you know the exposure and the outcome yes any other thoughts from chat no okay well so what do we do when we have confounders well I've mentioned adjustment but the word I also really like is the word de confound because there's multiple ways of dealing with confounders the classical way of doing it was to stratify your data by levels of the confounder and then you would calculate an effect size for each stratum and then calculate a weighted average across strata that sounds like a lot of work doesn't it and you may have been exposed to to this method in a browser this course called a Cochrane mental Hansel statistic so that was that was the old-fashioned way of doing it back when people were doing statistical testing you know on older computers or even by hand but now we can do regression so we can stick with that in a regression model we simply include the confounders an independent variable in our model and effectively it is controlled for because when we include a confounder as a variable in regression model we are stratifying the effect of our exposure of interests across different levels of the confounder it's effectively doing what you would do by hand but in the regression model I know regression models a little more complicated than that so but quick summary so if you've been kind of keeping up with the exercises here by the way if you've made your own dag you could consider adding another confounder to if you can think of any but we've kind of made this this one here we've got quite a few confounders so age gender SES genetic factors very common confounders once you almost always have to control for in health research and observational health research I should say the next question that often comes up is well what if I haven't measured my confounder and this is in fact what someone in chat asked about was like what if your confounders unknown or we don't have a valid measure of it and this is a common issue in observational studies you know we can't measure every possible variable there could be confounders that we missed ones that we're not aware of or ones that are very difficult to assess for whatever reason you know sometimes there are some some some variables like biological variables maybe it's a very intrusive variable to collect those data if it's more of like a behavioral measure perhaps it's something that's difficult to ask a person to to remember perhaps you know or it might be difficult to ask a person just it might be very personal so unmeasure confounding is a missing data problem we can't adjust or deconfound or stratify for a confounder we have not measured and so people in previous years of asking me okay well what do I do if there's an unmeasure confounder I think the best solution is sensitivity analysis to see how large a magnitude the unmeasure confounder would need to be to affect the the observed effect that you're seeing so that was kind of like what I mentioned with Fisher's hypothesis about the genetic factor where you where it would need to be nine times the the effect to actually introduce that much bias so yeah you would do some form of sensitivity analysis you would try different you know different magnitudes of confounder to see how that affects what you've seen what you have observed and if the effect size is very sensitive to this hypothetical confounder then you can interpret your results in light of that and include that as limitation there are some other options out there I think they're more advanced that you can look up if you're interested I don't I'm not going to go into too much detail but now we're going to get into some more a sticky issue colliders so I think what we need to talk about here with colliders is that they seem really similar to confounders like if you look at this dad and you kind of squint a little and you don't think about the arrows too much it looks exactly like a confounder right but it's not in fact it's kind of the opposite a collider is not a common cause of the exposure in the outcome like a confounder is a collider is a common effect so the outcome causes the collider the exposure causes the collider and the collider again it's not a mediator so to give an example with our ambulance and dog barking just going back to that a collider might be sleep disruption both the ambulance and the dog barking you know wake you up but you waking up is not causing the ambulance to be there it's not causing the dog to bark and that you that's to be very clear and if we put these two dag side by side and they're a little cut off but the important part here is that the confounder the arrows are pointing towards the exposure in the outcome in the collider the arrows are pointing towards the the collider away from the exposure in the outcome and why is it so important to know the difference well if we adjust for a collider if we try and de confound for a collider when like we would for a confounder we're we could actually introduce bias to our to our estimate in effect by adjusting for a collider we created backdoor pathway that wasn't there so with our collider we've you would effectively open this pathway between the ambulance sleep deception and dog barking which is otherwise closed so what's the proper way of dealing with colliders you don't include them in your model hypothesized colliders should not be adjusted for or included or de confounder for you can yeah if you want to be comprehensive so a side note on pathways so when I say a pathway is blocked what does that mean how can we represent that like how do we know a pathway is blocked pathways blocked when the two arrows collide with each other so we can see on our example of this example where we've got two confounders C1 and C2 they're on the same pathway though there is no at no point is this pathway blocked by any arrows hitting each other colliding but in our collider example here we see these two arrows collide at C2 and that means that pathways blocked unless we adjusted for it in which case we would accidentally create a backdoor pathway so yeah if you're following along again you can add a collider to your dag and let's think of one for here so what might be a collider between smoking and lung cancer or perhaps one of these other variables yeah yeah chronic coughing some some kind of other condition one can say like other health problem maybe yeah and you can see here it's great it's not read like the others because we don't need to adjust for it and in fact we again look up here what it says you know what do we need to adjust for it says age gender genetic factor SES but doesn't include the collider because we don't need to adjust for it if we went into into gg dag and added it here or quick we could say of smoking see if I did that right I don't know if I did do it wrong no just didn't include it oh because I did something wrong oh because there needs to be a comma there okay so it made it a little complicated looking but I think if we look at the adjustment set yep so as it says here you don't need to adjust for other health condition so just as Daggerty is telling us that you know we don't need to adjust for this collider yep okay good question what's the difference between a collider and confounder so a confounder is a common cause of the exposure in the outcome so that means the confounder causes the exposure the confounder causes the outcome so looking at our example here we could imagine that as people get older they're probably more likely to take up smoking you know like I don't think 12 year olds are very likely to smoke maybe back in the 1950s more so but not as much today and you know they're cumulative smoke like we could imagine smoking is probably more like a cumulative smoke exposure variable and then they're more likely to develop lung cancer just as you get older you know more mutations in yourselves more you know various exposures and so so in that case age both causes smoking and age causes lung cancer and it is not a mediator so that is that is a confounder now looking at a collider our example here of another health problem so we could think of like a chronic cough or you know there's a lot of other things different health conditions and I'm blanking but smoking causes the other health problem lung cancer causes the other health problem so in this case the the exposure in the outcome are causing the other health problem the collider so the collider is a common effect of the exposure in the outcome the confounder is a common cause of the exposure in the outcome I hope that clarifies things if not I'm going to recommend a book at the end that can probably really help if you're at all confused so I want to get into one other kind of issue that you actually probably remember from being mentioned in the talk before lunch selection bias what actually is selection bias I think I think we talk about it a lot but there's actually a very concrete definition and it's actually when we accidentally condition on a collider in a study design so the most classic example of this is the healthy worker bias or the survival bias you see this in a lot of studies of of coal mines and and lung conditions there where everyone who gets sick you know quits the coal mine so because if you're getting sick and you're you know you've got you know something in your lungs probably not going to keep working in a coal mine very long you you're going to quit that job and so if I then go in and say like I'm going to do a study of you know people who work at coal mine and look at their look at their lungs you know because all the people who got really sick dropped out of this you know dropped out of this cohort of possible participants for my study before I even started the study I have now accidentally conditioned on the collider of selection into so the way to look at it is that selection into the study is the collider in this example um so yes so that that that is the classic example of selection bias another kind of classic example of selection bias is um uh is uh let me think it was an example from the Iraq war uh the second one uh where they tried a new tourniquet out for protecting for helping soldiers get who are wounded in the battlefield get from the battlefield to hospital and they found that these tourniquets were awful that people who had these tourniquets applied to them died at a much higher rate than people who didn't uh and then someone re-analyzed these data and found while the people who were having the tourniquet supplied to them it was because they exhausted all other options and that's another example of where selection bias comes into account so um selection bias is really important it can be very difficult to to figure out where you need to you know kind of start your study but it's important to think about like you know could people already have the condition of interest could they have already you know dropped out from my eligibility pool you know how am i selecting participants i do have you know one professor who who even argues like some studies should start by enrolling pregnant mothers and then following the children throughout their entire lives so that you can really avoid any possible selection bias because otherwise maybe children could die from conditions early enough that that causes selection bias so it could it could be very um you could uh go down this rabbit hole but i think as uh professor macrigy mentioned actually is that there are some ways of identifying selection bias and considering uh what to do about it uh and she talked about that um so looking at the collider we've added as part of exercise three could it be a possible source of selection bias so let's go back to this we added another health problem could this be a source of selection bias any thoughts so if someone has another health problem that they got uh due to smoking and lung cancer we're doing a study of smoking and lung cancer but they have some other health condition as well could that be a source of selection bias yeah yeah if we're excluding them either intentionally because maybe we have an exclusion criterion that says you can't have these other conditions or perhaps they died you know perhaps unfortunately you know people with lung cancer can can die so like perhaps they they've died before we they didn't survive long enough to be in our study and so we have to consider that that and that that means we might be excluding some of the most severely sick people and therefore biasing our estimate of the effect uh kind of towards the null so that's really important to think about so yeah this is actually a pretty good example of a possible source of selection bias but it also depends on how we design our study because if we start our study by enrolling you know current smokers who don't have any health conditions perhaps this won't be an issue but if we start by enroll our study by enrolling people who are in the hospital for lung cancer already then this might be an issue so that's the other thing you have to think about is like when are you doing enrollment in your study so for our next segment we've got about half hour left it's just going to go through some common study designs and you know what their implications are for causal inference we're going to start with RCTs I think this is probably the one most people are most familiar with in bioinformatics so randomized control or clinical trials the exposure is assigned at random because the exposure is assigned at random there are no possible confounders of the relationship between the exposure and the outcome if if you you know if randomization was if you did it correctly there should be no possible confounders because nothing can cause the exposure right like if I assign whether you should smoke or not using a coin flip you know age cannot affect whether you're exposed or not to smoking you know this this is ignoring kind of treatment compliance that's a whole another issue but these RCTs as you've probably heard are the gold standard for causal inference however selection bias is still possible as mentioned you know it depends when you enroll people if we're enrolling people who you know are cancer patients who are already in the hospital perhaps some of the sickness of them have already passed away and therefore you know we're selecting based on you know more less aggressive forms of the cancer I kind of just answered that question that I was going to ask you how can selection bias occur in an RCT so yes it's you know like how people survive to be in the study are there other factors that are excluding them that you've selected on another thing that you're probably familiar with Mandelian randomization or instrumental variable analysis is what it's called in economics which epidemiology uses instrumental variable analysis to call it but they're they're the same thing this attempts to mimic an RCT using a source of pseudo randomization so if we look at our RCT or our DAG for our RCT we've got a coin flip which it's number four of that assigns the exposure and therefore there can't be any possible confounders because nothing can cause the exposure likewise in instrumental variable or mandelian randomization we replace that coin flip with some kind of pseudo randomizer and then mandelian randomization it's some kind of genetic variant in instrumental variable analysis it's sometimes like a new government policy distance from a certain hospital it could be you know like trying to think of an example like the the classic one that's being used a ton right now is states that that you know expanded Medicaid versus ones that didn't that's like the one that all the epidemiologists use so yeah there can be a lot but the point is that that kind of serves as a pseudo randomizer and therefore helps us by not allowing hopefully any confounders between the exposure and the outcome so this sounds pretty good right like we can use observational data but we can you know mimic an RCT and what might be I think there's one main issue with with IV and MR studies and what might that be I think you're saying like okay all this is it truly random is the issue is the question we have to ask like is it truly random I mean with a genetic variant you know like maybe we are more confident in that but especially with more of the IV side of things with government policies and whatnot that's probably not as random you know people might make choices or like distance from a medical center people choose whether to live in cities or not and that could you know like there could be socioeconomic factors behind that so yeah but if you can do one of these these are these are great so getting into observational studies so cohort studies these you've probably heard of like the Framingham heart study or I'm trying to think of the big you know these big studies that there's the one that enrolls all the nurses who graduate from nursing school and I think there's a similar one for doctors where they follow up with them over time over decades even you select people on a common characteristic for these studies occupation location or history or particular disease are the most common and then you follow the up with them over time to see if exposures and outcomes can be assessed but of course this observational study there's possibilities for confounding and selection bias can occur depending on when you enroll them likewise case control studies if you do any microbiome work I think you've seen you see a lot of case control studies in the microbiome field right now just because you know it's expensive so you select you know about like usually like 20 to 30 cases and 20 to 30 controls and you try and match the cases to the controls trying to match them on usually socioeconomic status and age and sex and other variables trying to to make your controls mimic the cases as much as possible but then the issue there of course is unmeasured confounding right like you can't match on everything there could be other variables that are important and selection bias can still occur and finally I just want to mention GWAS which is an observational study phenotypes participants are chosen based on specific phenotype or disease and GWAS can be either case controls or cohort studies and I'll just ask what are some possible sources of confounding or bias in a GWAS what was that ethnicity yeah um yeah so your economics status can definitely be in race um yeah yeah that could introduce selection bias yeah cool uh okay so we're going to switch gears uh and kind of go into our kind of work through an example together so I just want to ask if there are any questions before we we do that cool thank you everyone who stuck this out I know it's after lunch um so I just want to well this is you can there's more of just what I was talking about but um I have some examples here you know you can adjust for confounders in a lot of the common packages that we use uh in uh bioconnector de-seq2 agile lima um there's some examples here if you want to look at those um but uh I'm going to work through an example using uh data from curated metagenomic data so this is microbiome data um that Waldron lab offers the data set has 52 cases of colorectal cancer patients and 52 controls so our first step is let's create a causal diagram for this relationship so let's do that in daggity so um we are oh so I should start with reading this we are interested in whether our exposure colorectal cancer changes the outcome which is bacteria in the gut microbiome whether you know induces some kind of dysphiosis so we'll create a DAC for this relationship so our exposure is colorectal cancer our outcome is gut microbiome I'm going to ask some simple questions what way is my arrow going is it going from CRC to gut microbiome or is it going from gut microbiome yeah from colorectal cancer to gut microbiome right so just starting simple uh are there any things we believe that can confound this relationship what are some common ones we've talked about age age yeah oh yeah diet is a big one right in microbiome and gut microbiome studies definitely diet could be a huge factor especially colorectal cancer antibiotics so does antibiotics cause someone to colorectal cancer though yeah um and I should specify my exposure my outcome uh sorry how am I making a color okay uh so when you click on the variable uh colorectal cancer I didn't go up here to this top left yeah I can make this a little bigger uh you can see I can specify my exposure and my outcome so I've specified uh colorectal cancer as my exposure and then for gut I just removed my arrow uh but for gut microbiome I specified that as my outcome then once you've done that then it will then it will start coloring things because it needs to know what the exposure with the outcome are of interest any other possible confounders well was that something okay well we'll keep going uh so I'm just gonna say okay I've made my dad I kind of have my my theory on this relationship this is probably very simplified there's probably other factors of course socioeconomic status gender uh race um you could think about other things some kind of environmental exposure um some kind of measure of stress um you know you can start thinking this probably again a lot of variables but we'll keep this simple um and so switching over to to our uh yeah yeah yeah yeah right that's that's a good question yeah right like we could so this this is very simplified you could imagine yeah it could be the opposite to the gut microbiome causes colorectal cancer I think that that is something that has been exported is very very very possible uh you could also imagine there's a very much more complicated situation where like um you could say like gut microbiome at like time zero causes colorectal cancer which then causes you know the gut microbiome at some time one you know in the future so it could be you know that there's um and then you know that that it's more uh whoops I'm on that arrow that goes other way where you've got this very much more complicated relationship where their gut microbiome at time point zero affects whether they get the disease or not and then that also their gut microbiome at time zero also affects their gut microbiome in the present and at that point you've got uh what you possibly have what's called time varying confounding uh there are methods for dealing with that that are well beyond the scope of of this uh this uh like 90-minute workshop but you can look those up um the most uh the one I hear about the most are like uh like if you just google it even you'll find a lot of examples there's um it's a it's a subject much discussion at the research but there's the g methods using like g estimation and g computation um and what those are and what what they mean is kind of beyond the scope of this but if you're interested in that if you think there's time varying confounding uh that is what you'll need to do yeah so all this is assuming kind of we've got a very simple cross-sectional data but yes once you add like a longitudinal element to it then things will get much more complicated of course um that was a good question uh but yeah so we'll keep it simple though so I'm going to remove this oh I think I just need to delete that okay uh so we're going to be using curate metagenomic data so I have an example here where I added a few more variables we can see my dag here oops let's just stick here so I added a few more so I said you know between colorectal cancer microbiome got gender age diet health status hospitalization um hospitalization being a collider in this example um the rest I think are confounders and we can see my adjustment set so I don't need to adjust for hospitalization it's a collider but I should adjust for age health status gender and diet whether I have those data though not quite sure um I we can download the data from curated metagenomic data which I'm not going to do again because it takes a moment but we end up with a tree summarize experiment we can look at this one thing uh that is here we've got six NAs for whatever reason so I removed those and then we have 52 cases and 52 controls and so from there we make a file a seek object and then I'm just going to do a very basic uh using de seek to do very basic uh you know analysis just using study conditions so did they have uh where they cases or controls and this takes a moment on orchestra I was trying it out earlier it kind of takes a moment I do think we get like 175 significant results it's a little weird I need to look at these data a little more because I'm not sure that that should be right I don't I'm a little skeptical that there's that many significant results when you only have 52 cases and 52 controls and I'm using adjusted p values so something to look at um here we go so we can see these are the significant uh bacteria using the adjusted p value we have not adjusted for anything so this is just looking at study condition but of course we want to adjust for things so we'll add uh some of the variables that we're interested in uh and in de seek to it's very easy when we are specifying our file I seek to de seek to uh state uh function we just add uh age category and gender to it um I don't think there was uh there weren't health status and um diet I don't think there were data and curated management data for those for this study unfortunately um and then we can look at the results here so this uh the yeah there's 172 that were significant in our um study condition only model but now if we add age and gender we'll see out changes and ideally it shouldn't change much right we're just adjusting for two variables hopefully my results are not radically different like I think I'd be worried if I got super different results but um they should be slightly different because we've adjusted for two confounders according to our model yeah this takes a moment maybe it's faster on my laptop well we can just look at the results here uh we get 175 once we uh adjust for age and then if we add in gender we get 183 significant results so it does change them slightly uh so after adjusting for confounding by age and gender we found that colorectal cancer associated with increased abundance this is two but it should be 183 taxa um but we probably still hesitate to argue that what we've shown is causal um what sources of bias might still potentially affect what we've shown here that's my first question for you other confounders right selection bias sorry do you say I have a lot of yeah unknown confounders things that we do include in our dag um and how could we better design a study to account for these these sources of bias that's a great answer talk to subject matter experts people who who know colorectal cancer uh talk to them and see see what they they say great answer okay so I've got two discussion questions you've got about 13 minutes left um in our work uh at least when I in my work at least we are presented with data that's already been collected and we are asked to analyze it um but sometimes we're lucky enough to to design studies from scratch uh and sometimes you know you're sent a paper saying we need to submit this in a week and we are asked to give feedback um when during the study process should we create a dag and is it too late to make a dag if the analyses are are already completed or nearly completed and I I don't have a right answer so if anyone has any thoughts on this yeah yeah I definitely think earlier the better yeah yeah yeah that's that's a good answer because I I did once I was sent a paper and they said we're submitting this in 72 hours just just check the stats and make sure nothing's egregiously wrong because that does happen right uh and I I made a dag and I sent that to them and I said well you you adjusted for the wrong variables that you know this is probably a collider you know like maybe you should adjust for this so it's I don't think it's ever too late to make one but it how much use it can be varies like if you do it early on it can affect what data you collect maybe help you you measure some confounders that would otherwise go unmeasured um yeah um and then one subject that I didn't touch on in this workshop was generalizability or transportability of causal findings so taking the findings from you know our study and applying it to another population uh what issues might arise when doing so and is there a way we could use a dag to resolve these issues yeah exactly like the confounding structure could be very different in different populations like thinking about colorectal cancer and gamma-crabom diet might be very different uh when you're looking like a European population versus like population in like Japan or China yeah so the confounding structure fifth time could be uh completely different any other thoughts right causal like you might say i'll get at risk or super talk about yeah up in there but age probably the number of reasons you have all some right that's a good question so so how do you get people who who maybe aren't as you know bioinformaticians or epidemiologists to start thinking about causal inference kind of more holistically think about causes as being part of a complicated network of of things not there isn't just a single cause i mean i think one piece of jargon that that i didn't kind of mention is the idea of necessary insufficient causes so uh i'm trying to think of the simplest way to put this so we can think of you know like everything has multiple causes you know when it's a health condition you know lung cancer if everyone who smoked got lung cancer then that would be the one cause but that's not true we know that people who smoke don't get lung cancer people who don't smoke get lung cancer so obviously smoking uh is not necessary to get lung cancer but for some people just smoking even a little bit can be sufficient to get them lung cancer um but so as so we can think about sufficient causes which is a set of causes that are when combined result in the outcome uh necessary causes as uh causes that are required so like i think the classic necessary outcome for most health conditions is you need to be alive like can't really develop diabetes if you're dead so like that's a very simple one but there's other stuff like um you know to develop uh COVID-19 you have to have the virus that causes COVID-19 in your body uh so that is a necessary condition it's not necessarily sufficient to cause like there's people who probably get COVID-19 they you know it enters their body they do not develop COVID so it's not sufficient but it is necessary so so that's a bit jargon i didn't talk about but that is a point so i think thinking about causation as sets of variables and larger groupings of variables and not just one variable unfortunately the way we design experiments as often because we we're interested in one thing like we want to see does smoking cause lung cancer does you know uh does vitamin d reduce your risk of osteoporosis you know stuff like that so um yeah that that's kind of the tension here um so we just have like seven minutes left so i just want to wrap up um i wish i could come up here today and be like okay here's how you deal with causal inference in bio-conductor here's a package just download it and run this function and you're good to go i wish i could do that unfortunately i can't so i just got up here and told you a bunch of like you've got to make these these decisions on your own uh we as researchers have to just consider possible sources of bias and ideally as we begin designing our studies but you know as early as we can in the study process um and DAGs are really i think the take-home point here if we can make DAGs if we can draw those out and consider what assumptions our DAGs are making um then we are you know making a very important step towards considering uh you know the causal relationships in our variables of interest um so there's an epidemiologist Miguel Hernan who says you know we use DAGs so that we can draw our assumptions before we draw our conclusions um unmeasured confounding is a difficult problem for causal inference and observational studies and uh there is no easy solution um and finally we just need to remember that our definition of cause is rooted in that counterfactual definition and we need to find a way to come as close as possible to observing what would have happened if the participant you know if we had that time machine and could see what the participant's exposure status had been had it been different um so just to mention some some fun reading materials um my favorite here uh Pearl and Mackenzie uh Udea Pearl is kind of literally wrote the book on causal inference but it is a very difficult book to read i do recommend it here on the third bullet point but um it's full of you know he's a peer scientist he uses some very fancy mathematics it is very dense um but he also worked with a science journalist to write a more kind of like accessible it's kind of more of like a pop science but a little more it's a little more technical than the average pop science uh book uh called the book of why the new science of causal effect it's a great like summer beach read you know it's hot out take that to the beach and learn about causal inference uh if you're interested uh in a slightly more academic text uh Miguel Hernan and uh Robbins have written a free textbook called causal inference which is free online uh you can just download it and it comes with you know code and r to follow along with examples uh and then finally Miguel Hernan also has a has an edX course that is free um you know you can buy the certificate but it's free uh that teaches you all about DAX so thank you all for attending and sitting through this and if there's any questions i'd be happy to answer them okay got some questions on chat okay i'm not entirely sure i can understand that but second okay in mr there's some yafan when can you in mr can you explain how to use genetic variants as coin flip please so yeah so uh the genetic variants are you know you find a variant that uh you know is is randomly distributed uh hopefully and is this i'm not going to explain this well because i've never actually done mandelian randomization but uh the idea i think the idea is that you find a genetic variant that is uh associated with the exposure and the outcome but isn't uh but is kind of randomly distributed in the population and i apologize to anyone who knows a lot more about mandelian randomization if i'm butchering it but uh yeah there's uh ways to uh there's a lot of great resources out there uh if you just google mandelian randomization and i think there's there's a great program i think it's called mr base uh yeah this is it's gotten to come being our package again but it's also got a web app that kind of walks you through step by step uh and it you can a lot of people just do their their mandelian randomization using this or using that company our package and it has uh data through a lot of giwas databases so that you can just pull in the data it's really simple to use um yeah but i don't i hope that's that's an informative answer um ryan uh thompson says it's still useful to create a dag latent analysis because it will help identify issues that are likely to be raised by reviewers so that you can prepare to address those criticisms yeah that's a really good point you know reviewers um will ask you like why didn't you include this variable or did you think about this and if you have a dag you know ready you can anticipate that hopefully or you know sometimes you can even respond and say well according to our dag we don't really think that's a confounder but you probably still should do it anyway because it's the reviewer saying it but um yeah good point any other questions comments got like one minute left yeah i mean i think you could do that you could say like under this the assumptions of this dag you know here's what i find could be like almost a form of sensitivity analysis and saying like depending on how i imagine this relationship ideally i think you you come up with one like canonical dag that you think is the right one but maybe you test some others to see how sensitive sensitive it is to changes in the dag yeah you could definitely do something like that um that might be interesting yeah i mean i mean every dag is making assumptions so the assumption here is that age is a confounder that dies a confounder the antibiotics are not a confounder so there are a lot of like embedded assumptions in here that are just shown in the dag that not necessarily you know that we don't have to spell it because they're all just in this diagram yeah and you could add unmeasured confounders in here or you could say like i've seen some dag where they just have a variable you to represent unmeasured confounding yeah i think i think we're out of time but if you have any questions i think my email address is on here uh yep it's on here um feel free to reach out you know i wrote a bunch of responses to questions i got via email after the sec 2020 so if you have any uh questions please feel free and i'll add them here thank you so much everyone