 I'd like to thank the organizers for giving me the opportunity to come and talk to you guys about what we've been thinking about in my lab. So what I'm going to talk about primarily is stuff that's going on. So all of this is unpublished. Feel free to think about it, share it, whatever, but it's very much work in progress. Some of it's hot off the press, so do take it with a pinch of salt. So what we think about a lot is autoimmune diseases in my lab. And we want, we kind of want to think about which genes go wrong in disease and we think about dysregulated genes, but actually what we're interested in are the causal genes. And my pointer doesn't work, I can use this pointer, it's all coming up, Chris, today. So we're thinking more about causality than anything else. So when we say dysregulation, we're interested in pathogenesis, right? That's ultimately what we're after. And so just a 30,000-foot view of the immune system, if you remember, you start with a stem cell, you have two major lineages in the immune system, the lymphoid and the myeloid lineages. So things like macrophages are all the way down here and your T cells and B cells are down here. You think of them as adaptive versus innate. And what happens is every now and then this goes wrong. So the immune system's primary function is to protect the body from things that are foreign. And so it's got this amazing capacity to tell the difference between your cells and the rest of the world. And it's really good at this, but occasionally it screws up. And it kind of, what happens is that it starts attacking certain tissues. So if it doesn't like myelin, you get multiple sclerosis. The immune systems manage to go into the brain and attack the myelin sheath very specifically around neurons, chew it up, and you get lesions into your brain. You can get things like skin attacks, which give you chogren syndrome, scleroderma. You can get type 1 diabetes, which we now know is an immune disease. If it doesn't like aspects of the GI tract, you wind up with Crohn's disease, ulcerative colitis, or celiac disease, if it doesn't like the epithelia. Joint, specific joint dislikes, shall we say, give you rheumatoid arthritis, or ankylosing spondylitis. And if it just doesn't like DNA, if it doesn't like nucleic acid, it attacks everything and you wind up with something called lupus. What's really interesting is that these are very, very specific dislikes. So MS is not rheumatoid arthritis. It's a very specific attack against myelin. It's not a specific attack against anything else. And what we really want to understand is what these diseases are. So something's going wrong with the immune system. We don't really understand what it is. What we do know is that all of these diseases are common. They're complex genetic diseases. There's a large portion of heritability. They track in family, but they're not Mendelian. It's not one catastrophic mutation. And of course, as GWAS came along, I'm going to talk about multiple sclerosis, which is something that I work on. But you can take this as red for any immune disease. As GWAS came along, we hadn't really gotten a lot of traction on the genetics of these diseases. And then we barely managed to identify two loci in the genome in one of the first GWAS studies. Then a little while later, we managed to get another one. A meta-analysis of these two sets of studies from International Consortia kind of gave six new hits. And we're starting to climb this power curve of discovery. Then a further meta-analysis with more markers and a few more samples gave us an additional three new hits. Even more samples gave us another 25 new hits. The immunorship gave us 47. That took us up to 100. And our current studies, which are about 16,000 cases, 26,000 controls, and replication in another 36,000 samples, we've got another 100 odd new hits. So we're standing at around 200 loci right now in GWAS, right? That explains, including the HLA, it explains about 55% of the heritability. We estimate that in the common space, there's probably another six to 800 loci that we don't know about yet. We kind of do know about them. They're not genome-wide significant yet, but we know they're there. And we know the approximate complexity of the disease is about 1,000 independent variants. And so when ENCODE came along and we did, we were a very small part of this paper from John Stam, sort of showing that in Crohn's disease and in multiple sclerosis, there is strong enrichment of the risk snips on regulatory regions active in very specific subsets of the immune cells. In multiple sclerosis in particular, you can see CD3 cells, CD19s, B lymphocytes, and CD14s, which is interesting. There's a lot of pathogenesis coming out of T cells as well, but these are more B cell-like ends. So dysregulation in multiple sets of immune cells seems to be an issue here, but this kind of sends us chasing down this idea that is now extremely common. And this is one of the great things, right? So 10 years ago GWAS wasn't going to work. And five years ago, everyone was asking why we haven't solved disease yet. Five years ago, everything was coding and now everything's regulatory and it seems really obvious. But even two, three years ago, this was not that obvious. And so this starts us chasing down this rabbit hole of which genes are getting dysregulated and how does that cause disease? And so that's what we're going to talk about today. Further evidence that in specific immune cells, you get dysregulation that maps into specific transcription factor binding sites. This is from Kyle Farr and Brad Bernstein, showing that the MS snips are particularly enriched for NF-Kappa B transcription factor chip peaks, for instance. And so there's something that's fairly specific dysregulation in immune cells, which is great in bulk, hard when you actually want to identify specific effects on specific genes in specific cells. And so that's the task at hand. And so when you look at some of these loci, you know, you put up a GWAS locus, here's a classic locus in MS. Well, there's NF-Kappa B1 and mannose binding protein A. And you could sort of make a case for mannose binding protein A, but really, everyone's going to assume that NF-Kappa B1 is the appropriate gene. And it turns out that that's right for various reasons, and so you can start working on that because you're kind of reasonably sure that's the gene. When you look at another locus, of course, that gets a lot more difficult. You've got this big association peak, there's a bunch of genes in here, and the problem isn't that they're not good candidates, there's a bunch of good candidates in here. Diarrheum DL3 is here, IKZF3, which is Helios, which is a transcription factor that controls T-regulatory cell differentiation, is there a bunch of other immune cells, and so you're kind of going, what's going on here? So we kind of thought, okay, if there is regulation and we have SNPs, how do we unite the genetics with the epigenomics? And a lot of people are thinking about this. You're going to hear a lot more stories about this. You've already heard some. Here's how we've been thinking about it. So we're kind of amateur math geeks, and so we think about how we can transfer some of this probability and do some functional fine mapping. So you have a set of SNPs in the genome. We're going to talk about hypersensitive sites now, but instead of DHS, you can think of any regulatory mark. We've been working along with hypersensitive sites, because we like them. They're stable. They're nice. They tell you a lot. We're going to expand this to the other sets, but think about DHS for now. And you've got a gene in the locus. So this is my, like, tiered view of a locus. So each of these guys is associated to disease. Ah, oh, this is going to chop off my things. Oh, well. So what that says is posterior probability of association, or PPA, okay? So when you do a GWAS, for each of these SNPs, you get a P value of whether it's associated to disease or not. You can convert that simple P value into basically a posterior probability, which tells you what is the likelihood that this SNP is the one driving the signal, okay? We're not going to talk about the math magic that underlies that. I'll bore you with it in person over coffee, if you like. But basically, for each of these SNPs, you can do a magical transformation and get the probability that that's the SNP that's driving signal. If it's very associated and nothing else is associated, it's going to be really probable to drive the signal. If there's a whole bunch of SNPs that are equally associated, you're going to have to spread the probability that it's causal over all of those guys, right? That's the intuition here. So of course, some of these SNPs are actually on DHSs. And so you can transfer that probability. I can't even talk anymore, sorry. That probability to the DHS. You could also do something fancy like, say, this guy is about this far away from this DHS, so I'm going to give it some proportion here. That's, we're not doing that right now. But basically what I can do is come up with a way to score every regulatory region for what their probability of explaining the association in that region is, right? And if I sum all of those, of course not every SNP is on those. But if I sum all of these posteriors, that gives me the global probability that in this locus, association is mediated by these regulatory regions. It doesn't have to be all of it. But if most of the signal is on DHSs, then you're going to get a high percentage, right? It's going to be close to one. If it doesn't look like it's being mediated by regulatory regions, you're going to get a low proportion. So, so much is easy. What's cool is you can then think about how you correlate these guys to the genes they control. So if I had a magic way of saying, well, this DHS is correlated to this gene this much, then I can wait how much of the posterior of association gets transferred into this gene, right? So if this guy's perfectly correlated, if this is what determines whether this gene is expressed, then if this explains all of the association to a trait, then presumably it's acting on this gene. Because the DHS isn't just a DHS, it's regulating something, right? So that's the intuition. And you can partition this all this way. And what it says here is CP times PPA, okay? So that's just the correlation posterior between this DHS and this gene times how much weight you've given it from the association data. And that way you wind up building this model of this gene posterior. So if I sum all of these, all of the contribution of each DHS from the SNPs going into this gene, I can get a sense of what the probability that this gene is driving association in this region is. And I can do that for any gene. So I now derive a score, basically, for how likely this gene is to be pathogenic if that pathogenesis is mediated by DHS regions. And we know they're enriched, so that's a reasonable hypothesis, okay? It's not the only way to do it, but it's one way to think about this. And so you have to solve a couple of technical problems to do this. One is you've got to correlate your DHSs to your genes. And so that's really simple. You just observe if there's a peak and what the level of expression of a gene is, and then you do a correlation, on, off versus level of expression of a gene. And you do that for each DHS you find. Two issues. First of all, you've got to decide what the same DHS is. And secondly, you need measurements where you've measured both DHS and gene expression, okay? So to do this thing, we use an alignment approach. This is what real DHS data looks like out of hotspot. These are peaks. This is an arbitrary part of the genome, and your job is to figure out which ones of these represent the same element across samples. We're not terribly good at that as human beings. Fortunately, computers are a lot better at this than we are. So you can put in a clustering approach and kind of decide that these look the same, that are a little jittered, but they kind of look similar. And then these guys are kind of the same, but you're maybe a little bit less confident because there's more spread. And these guys are kind of the same as well, but there's even more spread. Okay, and the way we do this is with Markov clustering. It's a way to cluster stuff. There are other ways to do it. It works reasonably well. And the way you think about this, and that's gotten chopped off as well. That's brilliant. Okay, so one way you might want to do this is to say, is this detectable? And so you go into the roadmap data. Unfortunately, there are replicates. And here's my assertion. If I see a peak here in replica one of a tissue, then I should expect to see that peak in replica two of a tissue as well, right? Biological replication, just like we do in any other experiment. Really simple. And so once I decide this is my cluster, that's what comes out of the algorithm, you don't just go and apply that mindlessly to data. That's not how you do analysis, right? You check and you see what you can detect. And of course, the wider and the sloppier this peak is, the less likely it is to be true. And so you can do a statistical test. And so once you've decided what the cluster is, if there's a peak anywhere in that cluster, you mark that sample as a one. And if there's no peak, you mark it as a zero. If you have replicates where the labels are somewhere over here on that wall, you can then say, okay, if I see ones in both replicates, I'm going to score that tissue as a two. I'm going to score it as a one if there's only one replicate. So if it's discordant, I'm going to score it as a zero if there's none there. And then you can do a test. So I've done this without knowing about replicates. And then I add the information about what goes with what. And I ask are they consistent? So if I get things like, look, in cell type one, I get a one. And in two, I get a one. I get all ones. That suggests this isn't consistent. It's not replicating. And if I get a lot of twos and a lot of zeros and very few ones, that looks consistent. So it's replicating. It's either not there or it's there. And so I can do a statistical test. It's not terribly important what the test is. It's a simple chi-square approximation. We do this over 57 tissue replicates from roadmap. And we find that just feeding this in when we cluster, we can get about a million out of 1.99 million. So about 54% of our clusters pass a fairly stringent threshold, a fairly lenient threshold. And that's because very often these things are kind of diffuse. The clusters don't look really good. And so we're probably not doing great at the clustering. And it's unreliable. There's also a bunch of singletons in these data that get thrown out because they don't replicate. But most of this is actually the clustering. So we can get about a million features out the genome. And we don't worry about recovering more stuff and improving the clustering. Right now, we're just working with these million. So the other thing is you've got relatively low power. And so what's nice about this is this, what you can clearly read here, what you can do is estimate how much of the heritability you're still explaining. So this is just a sanity check. If you use all of these clusters, it's about 14% of the genome. And it explains the proportion of heritability. And what I want to know is if I reduce this to the half of the clusters that I'm using now, what proportion of heritability am I still explaining? And to a first approximation, what you can see here is in red is all the peaks. And in blue is just the clusters that we define. Pretty much we're capturing all of the signal. It varies as wiggle room. There's a little bit of error on these things. But we're capturing just about all of the heritability. But we've gone from 14% of the genome to 8% of the genome. So rather than do the 500 base pair either side, which is what most of the previous heritability estimates have done, which a lot of the summary papers have kind of shown, oh, there's enrichment in DHS or in regulatory regions or whatever, but they actually bracket each feature by 500 bases and so they cover 50% of the genome. So yes, all of the heritabilities explained by 50% of the genome. I'm telling you that a lot of the heritabilities explained by 8% of the genome. So it's a little bit more specific. And so the second challenge is to now correlate these guys, now that we've decided what clusters are, to correlate them to gene expression. So you need matched data. We use 22 sets of matched DHS and exon array data from roadmap again. And the problem is there's massive inflation because gene expression data, of course, is highly correlated. And so you just get this massive inflation in the expected distribution of these tests and we can correct this. We just go through and normalize it and basically you kind of start off with this massive inflation. I'm showing you lambda here. It's supposed to be a nice straight line here and we can correct all of that out. So now that we have all of these statistics, we can go back and do our little approach. So now we have this part, we already have this part from credible interval, set mapping and posterior estimation and we can now estimate gene-wise scores. And so big red exclamation point here you can't see means this is really fresh as in last Friday's results, hot off the presses. Here's a region. We're talking about MSGWAS. This is actually the immunochip data from 2013. Chromosome six, one megabase region and I'm doing this for all of the genes in the region. DHSs explain 94.5% of the signal in the region. So whatever it is, it is really, really likely to be acting through a DHS, right? MNDN1, which is one of the genes in the region, explains 55.5% not of this 94% but of the total signal in the region. That's how it feeds through, right? So that's what I'm doing. So buck two is 16%. Between these two genes, you'd be hard pressed to say that any of these other genes are really sort of pushing the signal but it's probably this one. So this is a way to prioritize genes based on regulatory potential. Now it's really important to look at this number as well. If this number is low, you kind of think well, it's not really likely to be regulatory in the first place. If it is, it's gonna be one of these guys but it's not likely to be. In this case, it's really likely to be, right? So if you look at another region, this region that I showed you before, the ICAZF3 or MDL3, brilliant, they've chopped off. Okay, that reads 0.029, 0.022 and they're ranked and it goes down from there, okay? So you'll see that in this region, about 30% of the association signal is explained by DHS clusters as we've defined them, okay? So it's not a lot of it. That 30% is now basically smeared over a whole bunch of genes. There's no one gene that explains that signal. So even if you accept that I'm willing to take this 30% as a gamble, there's no one obvious gene you look at. And the reason for that is actually that we suspect that what's going on in this region is there's an entire element, sort of something like an accessibility element. Some people call them super enhancers, they can mean different things to different folks. What we suspect is going on is there's an element that sets whether the entire locus is accessible or not accessible and that affects the transcription of multiple genes. And so what the effect may be is actually that you're changing whether this entire locus is available or not available. And there's a whole bunch of genes in there that then do different things and set a risk state or multiple risk states. And so sadly it's not always one locus, well, one gene, but these are probably gonna be really interesting. It's unclear whether we can solve such loci, but they're gonna be really interesting. So you're gonna get examples like this and it doesn't work all the time, this approach won't work all the time because not all loci are simple in the one gene thing. So we're gonna have to think harder about these. So I'm gonna switch gears in the dying moments and just give you another flavor of how we're thinking about the other way around of epigenomics. So so far we've talked about how to analyze these data and make inferences so we can then go and work on certain genes. But what we really wanna know at some stage, if changes to gene regulation are what is creating disease states in the immune system, well, you're not born with an immune disease. Most of these diseases occur in the third, even fourth decade of life. So what's the risk state in immune cells that predisposes you to disease? That's a hard question, that's a really hard question. So I told you before that you can see in multiple sclerosis a fairly strong enrichment in NF-Kappa B binding sites that are near associated SNPs. Okay, so there seems to be something about NF-Kappa B. And I also told you that there's an NF-Kappa B1 locus that harbors a lot of a very strong association. So when you look at MS patients versus controls, if you look at CD4 cells, you find that in response to stimulus, in response to TNF alpha, ex vivo CD4 cells actually signal much more strongly through NF-Kappa B. And this is a measure of phosphorylation of P65, which is one of the NF-Kappa B subunits, okay? If you look at how inducible CD4 cells are, how easy it is to activate CD4 cells from MS patients versus controls, you find that these are controls, the black circles, the filled ones are MS patients. You find that in general, the CD4 cells are easier to activate through NF-Kappa B. You can just hit them and they'll go. Correlations, no causation. This could be an epiphenomenon of disease state. And so what we did is we took this NF-Kappa B1 and we stratified people by genotype there. There's no implication of causality for the SNP we used. It's actually one of these really long haplotypes. It's identical. We just used it to stratify risk, non-risk. And we're looking at opposite homozygotes. And so when you look at the three genotype classes, without stimulation, this is your baseline, I'm sorry, it's chopped off again, but this is your baseline I-Kappa B degradation. So I-Kappa B gets degraded when NF-Kappa B signaling starts, right? And what you see is a baseline, this is 100%, and by genotype, you find that there's a difference in how strong I-Kappa B degradation is, suggesting that there's different amounts of signaling going on in these cells. If you do the obverse and look at the phosphorylation of the P65 subunit, again, you see the same sort of thing that this GG, which is the risk state, overphosphorylates compared to the other genotypes, suggesting that there's more signaling through NF-Kappa B happening for unit activation. That's kind of interesting, but actually, if you look at the expression by Western, so this is protein expression, if you try and quantitate how much P50, which is an NF-Kappa B subunit you're seeing, you see like a 20-fold increase in how much P50 exists, just a baseline in these cells. What's really interesting is that after activation, if you measure nuclear localization of phosphorylated NF-Kappa B, you see that there's about a three-fold change between with the GG risk homozygot putting a lot more phosphorylated NF-Kappa B into the nucleus following stimulus in CD4 cells compared to the A. And so what it looks like is for a given dose of stimulus, if you have the risk genotype, you signal a lot more, which probably does two things. It probably decreases the activation threshold required to kick these cells over into an activated state. And it may also smear the phenotype that you see because there's so much transcription factor going into the nucleus that it's activating everything. And we'll talk about that in a second. This is not quite as simple as just a single effect at the NF-Kappa B1 locus. If you look at the TNF receptor, there are two subunits, there's a variant in the first subunit, TNF-R1, now called TNF-SF1A. There's a coding variant where if you hit cells with TNF-alpha, you get different amounts of signaling through the TNF receptor, which leads to different amounts of phosphorylation of NF-Kappa B. Again, you're getting different amounts of signaling. I won't belabor this. It also turns out there's a whole bunch of other genes in the MS risk flow side that are directly related to the NF-Kappa B signaling. And so I suspect one of the things that's happening here is you're getting this global effect on NF-Kappa B signaling. It's not that simple as just a linear effect, but there are multiple things that feed into NF-Kappa B signaling, at least in CD4 cells, maybe in multiple other subunits that are kind of really setting the real stat of how the immune system responds, and maybe that's how, partly, how risk is determined. And so, oh great, these are chopped off as well. So here's the model, right? Sort of, with external stimulus, you get phosphorylation of NF-Kappa B. NF-Kappa B translocates to the nucleus, and it does what it does best. It activates a bunch of its targets. It activates the transcription, and that leads to activation, proliferation, and survival of these genes. Here's what happens when you change this. If you increase phosphorylation, you're gonna get more NF-Kappa B going into the nucleus. That's probably gonna activate its targets more easily. There's probably spillover, right? NF-Kappa B only activates a subset of its targets in any one given cell, or cell type. It's got a bunch of other targets, which it doesn't activate because the cofactors aren't there. Transcriptional activation is a multi-cofactor process. If you've got enough excess, even though the kinetics of these promoters are bad, there's gonna be shuttling on those promoters, and you're gonna get leaky transcription. And so, this, I believe, this is an assertion at this stage, right? But I think you're gonna get context and appropriate gene activation as a result of just putting a lot more NF-Kappa B into the nucleus. I showed you before right at the beginning that there's also risk variants that localize close to NF-Kappa B binding sites in the genome. So promoters where those variants exist, you're gonna get differential activation of those promoters in a way that's probably unrelated to the total amount of NF-Kappa B, but that's an additional modulation. And so, here's what you can do. You can take a bunch of cells, take people who are risk variant homozygotes and people who are non-risk homozygotes, so people who have different amounts of NF-Kappa B in the nucleus, hit them with TNF. In 15 minutes you get signaling, so you measure how much phosphorylation you're getting. In 30 minutes you get translocation, so you can actually do chip seek and see where NF-Kappa B is going into the nucleus. Within two hours you get gene activation in CD4 cells, so you can do enhancer mapping and RNA seek and see what's changing in the regulation between these two groups. And within three days you can get cell phenotype by producing the full activation stimulus and you can measure that by flow and actually see what these cells are doing. So these are the level of experiments that you really need to do and this is what we're doing right now to actually see what the differential risk states are in various T cells. And I'm way over time, so I'm gonna stop and I will just acknowledge a bunch of my colleagues at the IMSGC, the International MS Genetics Consortium, a lot of partners including Brad and John where we do a lot of these genomics things and people in my lab. Most of the causal mapping that I showed you is from Parissa, a postdoc in my lab. All of the immunology is from Will Housley, who's a fellow with David Halfler and with me and I will stop and take a couple of questions if I'm allowed. Thank you very much. Great work. So quick question with regards to, maybe you've seen Peter Skacheri described MEVs a couple of years ago, so multiple enhancer variants where there's multiple DHS or whatever measurement. And there's a wide range there. You can have MEVs with only two DHS, with three, with four, with five. So I'm just wondering how you take that into account in your pipeline. Are the genes that you find at the top of your list biased towards risk low side or risk low cost that have a high MEV nature as opposed to those that have a lower MEV nature? And the second thing, there's a large collections of risk low cost that actually are singletons that will only have like one DHS sites in them. How are you taking those into account? Right, so this is not about a single peak. This is about whether the peak is consistent across cell types. Right, if there is only one peak in the entire collection of samples and you don't see it in the replica, we throw it out. It's a singleton in one sample, so if I have two CD4s and I only see a peak in one CD4 and never in any other cell type, I'm throwing that one out. So that's gonna be true for one DHS site. Right. But an MEV would have two of those, or three of those. So what we're not doing is the combinatorics of the clusters yet. We're basically not thinking about MEVs, but the correlation should still be there. It should be multiple of them correlating. So that gets naturally taken in to the correlation towards the gene, because all three of those DHSs should be correlated. Right, but they might not go to the same genes. So you might have one risk-locust multiple genes regulated differently by different subset of their DHS. But if they're regulating different genes, then the risk that they're imparting should only go to the gene that they regulate. Yes. Does that make sense? Yes. Because you're trying to figure out which gene is being altered by whatever the risk effect is. And so if DHS one is correlated to gene three, I don't care about transmitting its risk quotient to gene two, because it's not, there's no evidence that it's controlling it. Yeah.