 The next speaker is Chris Cotsappas from Yale and also generally at the Broad who's going to talk about new work on mapping regulatory regions and genes in disease. Thank you, John. And hello, everyone. I'd like to also thank the organizers for inviting me here. There's always a great meeting. I came last year to the DC one and I'm really happy to be back and tell you a little bit more about what we've been doing. So the work that I will tell you about today is led by a postdoc in my group, Parisa Shustari, who's just finished it. It is now up on bioarchive. It's currently in review. And if you do like what I have to say, we're looking for new and interesting people to come and think about new and interesting stuff with us all the time. So that's a plug. So sort of come and work for not very much money, like everywhere else in academia. So we're interested in the autoimmune and inflammatory diseases. There's a huge range of phenotypes that basically are driven by a loss of tolerance to self and it's either tissue specific or systemic and this goes for everywhere from getting holes in your brain and MS to getting holes in your gut in Crohn's and Ulcerative Colitis to pancreatic attack to a general dislike of anything that has DNA in it, which we call lupus. And yeah, these are pretty bad diseases. They're fairly common. So about 8% of Americans have them. They're disproportionately prevalent in women. So about 80% of patients overall are women. And our answer is basically to remove people's immune systems, which is kind of not a great answer. So there's a lot to do. Over the last 10 years, we've done a whole bunch of GWASs. They've been hugely successful in these genes. The latest one, for instance, in MS that I'm heavily involved in, we now have 200 loci with common variants that mediate risk. We explain, you know, mumble, mumble, mumble, 60% of the heritability. And we have no idea what's going on. So that's great. That's fantastic. So over the last five years, I won't belabor this, but it's become really obvious that the majority of these effects localize to regulatory regions. They're mostly tissue specific. John, as part of the ENCO, the Roadmap Project, showed this really, really elegantly in 2012. A whole bunch of analyses have since shown this in different ways. So it doesn't really matter if you take the top snips and ask if they're enriched in regions, if you do a much more formal heritability analysis, if you look at histone modifications versus DHS, the signal is pretty much there. So it's kind of believable, which is great because we have no idea how to look at these things. That's fantastic. So the biggest problem as a geneticist that I think of is how do I persuade myself that going from a general genome-wide view like this to enhancer 23 on chromosome one is going to give me pathology? Because it's really tempting to take an association peak, look under it, or take the most associated SNP and say, hey, look, this falls into a motif. But we know that in common variant studies, the most associated SNP is not by definition pathogenic. In fact, it's unlikely to be. And you can prove this to yourself in a number of ways, including if you simulate a bunch of data and tell the simulation what the causal variant is most of the time you wind up with it not being the most associated SNP. And that's just a thing. That's a limitation of genetics. So we set ourselves a problem, OK, can we come up with a more nuanced way to identify specific regulatory regions that we think are mediating risk? And can we apply it to the autoimmune diseases that we're interested in, although the method would be general? And so that's what we did, allegedly. This is all allegation at this stage. So here's an association peak. This is, it could be any peak, but the idea basically is every dot's a SNP. This is a log p-value of association. There's clearly an association to risk somewhere in here. And if I can start from summary p-values that are available, I might be able to derive a posterior probability of association, which we kind of know how to do now. So these are often referred to as credible interval sets. So you identify the set of variants that is 90% likely to contain the causal variant if you've actually genotyped the causal variant or at least imputed it. And we know that this works reasonably well. And for this, it's pretty important to have genotyped or imputed the causal variant. And so our first pass has been on something called the immunochip, which rather than a GWAS chip, where you genotype a lot of variants across the whole genome, but not every variant. We look at about 200 regions of the genome we know are associated to autoimmune disease, and we genotype everything in those. And so the Immunochip Consortium, which I'm part of, kind of came together and designed this array about five years ago. And the important thing is that we genotyped about 200,000 samples as part of many, many international consortia across many of these diseases, and including controls and cases. And so there are large cohorts genotyped on exactly the same platform, and very, very densely genotyped in these 200 loci. And that'll become important a little later. So that's a marker that I will pick up in a little bit. But we can go through these regions because they're well-powered and because they're dense. We can estimate credible intervals. And through some kind of magic, we can overlay these SNPs with regulatory region information. In our case, we're using DHS peaks. And the idea is for every DHS peak that contains SNPs, I can calculate the posterior of association, okay, purely by this overlay. So I'm color coding everything from gray to red. Red is interesting. Gray is boring. You've seen this before. A lot of red SNPs are on this DHS. And I can formally calculate the posterior probability that this DHS mediates risk. This is not just an association statistic, but the posterior. So how likely is this DHS to explain the signal here, okay? This DHS might be correlated to a set of genes. And this is where it gets tricky. So you want ultimately not just to identify a region, but you want to know what gene it's affecting. And so if I could figure out what genes are controlled by the factors that bind in such a regulatory region, I could then use that relationship to estimate a posterior probability on this gene of being pathogenic. So, you know, math is nice. We like math. If this doesn't make any sense, this doesn't work. But if I can make this go, then I should be able to score any regulatory region and any gene is likely to drive risk, okay? End of math lecture. Now let's look at some stuff. So the key thing here, all of this stuff works, but the key thing is to build this relationship here. And for that, you actually have to do something like this. There are many ways to think about this, but the way we've chosen to do this is let's correlate the accessibility state. So if the site is open to the expression of nearby genes. And if there's a relationship between whatever it is that's happening in this site and the expression of this gene, if this is a control thing, then every time the site's accessible across tissues, this gene should be expressed at a different level than when the site is not accessible. And here's a toy example where these guys are clearly not correlated to the state here and this guy is. So maybe this DHS is actually controlling this gene in a way that I don't know. So I'm not making statements about molecular biology. I'm just saying this is correlative. And then I can use that to create a transitive relationship between the risk on this DHS to this gene. So this, it turns out, is a little bit tricky to do. And it's tricky because you have to make sure that the DHS you're looking at across tissues is the same thing. And so we solve this in this way. We take all of the roadmap and encode tissues, anything that's got a replica in DHS space. We call peaks de novo just because we want everything to be on the same footing and just make sure everything's okay. We wind up with a set of peaks. We then cluster these peaks across tissues. And we cluster them really simply. So for any event where you've seen a peak here in sample one and a peak in sample two, you calculate the overlap divided by the union as a fraction of how similar they are, okay? So it's a similarity coefficient, your card distance, whatever you want to call it. But then you use those relationships to cluster across tissues. It doesn't really matter what this does. It's just a method of clustering that does that. We wind up with a set of clusters, almost two million strong. But then the interesting thing is that we wind up doing a replication test. And this is why we need replicate tests. So I've just clustered naively to what these tissues are, whether they're the same or different and so on. And then I just say, okay, well, if I have a cluster and my clustering algorithm has decided that these guys are all the same, if it's biologically plausible, then I should see that it's accessible in both tissues, in both replicates, and not accessible in both replicates, right? Straight up, replicate reproducibility. And you see that that's a bad cluster in this cartoon and that's a good cluster. So it's always either accessible or not accessible. And so we can just apply a really simple filter. It turns out that you can write down basically a chi-square one degree of freedom test that explains that and anything that beats a really weak nominal filter passes. About 54% of the clusters that we define pass this level. So we wind up with about a million clusters. We think that part of it is because there may be, the clustering algorithm isn't quite as sensitive as we need it to be and part of it is because maybe some of the peaks don't replicate. It doesn't really matter why. We wind up covering 8% of the genome instead of 14% of the genome and then you get stuck and you say, okay, well, why is this not just crappy analysis? And so we went back to GWAS and thought, okay, I know that the DHS peaks are enriched for GWAS signal. If this subset of clusters that I'm picking up is actually biologically plausible, they should also be enriched for that GWAS signal. And so going through the latest MS GWAS, we just did a heritability analysis of our clusters in blue versus all of the DHS peaks in red. And basically we recapitulate all of the heritability. I won't go through all of this, but this appears to work. So we capture most of the meaningful heritability and our information in the subset of clusters. So our analysis goes. So basically we can wind up doing this thing here. So here's what happens when you run this on this model. You have a region of the genome. I'm showing you two megabases now. Here's a densely mapped locus on the immunochip that I've shown you before. This is for inflammatory bowel disease risk. That's a 10 to the minus 30p value. So this is a really strong association. These are all of the genes in the region. It winds up that when you run this model, you wind up with a single DHS. We don't really have a great coordinate system, so we just label them arbitrarily in terms of numbers, but it's just a 250 base pair region that's DHS accessible. And it explains a lot of the signal. If you then do the correlation, this guy is correlated to exactly one gene, ETS2. And if you look at how strong the correlation is, this is every gene in the region that kind of survives the analysis. If you look at these little plots here, they're kind of like modified dot plots. Blue is the expression level of this gene when the DHS is not accessible and orange is when it's accessible. And you can see here that there's no real correlation between accessibility and expression level. And if you look at ETS2, there's quite a nice correlation between those two states. Okay, and that's what drives this relationship. And so ETS2 looks like a sort of compelling candidate here, not just because the most associated SNP lands in a DHS and I've decided that this DHS is interesting, but because I can put it on a posterior probability about this and actually formally declare how interesting this is. You run this on other sites, I'll just trot through these. This is a really interesting locus for multiple sclerosis. It's the CD58 locus on chromosome one. We are reasonably sure through a variety of strands of evidence that CD58 is relevant here. And so when you do the same thing, you get two DHSs, the picture's a little bit more complicated, but clearly the strongest gene that comes out is CD58. The shading can be hard to interpret, but it kind of helps a little visually. What's really interesting is that CD2 is CD58's binding partner. CD2 doesn't really have any strong evidence. So I may even be unsure which one of these two DHSs is actually driving risk, because this one has a little bit of evidence, but this guy's fairly clear to the gene, because they're both correlating to it. And that's useful, that's useful to know. And if you look at the expression levels, again sort of CD58 versus CD2, CD2 just is clearly differentially expressed in different cells, but it's not really got very much to do with these two DHSs. Again, I won't belabor this too much, but you can go through many, many regions doing this. This is the RF8 locus in RA, sort of same deal. And so when you run this over nine diseases across 200 loci with many, many DHSs, here's a quick summary of how this winds up going. These are now different diseases down here. This is the number of DHS clusters that are in a locus, and this is the number of genes that are potentially in a locus. So for primary biliary sclerosis, sorry, primary biliary cholangitis. Gee, try saying that. You start with a thousand odd DHS clusters in the region. By the time you run the model, you wind up with something like three per locus. This is a log scale. Just doing this narrows down across all diseases. The number of DHSs you need to consider by a huge factor. So even if you then want to be agnostic and look at all of them experimentally, this is now feasible in a way that it wasn't before. In the same way, if you're thinking about possible candidate genes, if you run the model, you go from 20 odd genes to three per locus across 200 loci per disease. And that kind of replicates across diseases. So the fine mapping actually works. If you prioritize genes fairly cleanly, even if you can't identify a single gene, which we can in about half of the cases. If you now look at how much of the risk in each locus is explained by DHS, identifying DHS doesn't matter if they actually don't explain a lot of the risk. There will always be a best scoring DHS. It doesn't mean the best score is a good score. Okay? So we find loci where clearly all of the risk is on DHS. The median is something like 20 to 25%, which is on something like 0.02% of the space on the locus. So that's an enrichment. But about half of the loci have more of the risk explained by DHS than not by DHS. So very clean regulatory potential. And it kind of depends on which of these you like to do. And similarly, in the posterior per gene, you find that a lot of genes score very cleanly and very strongly suggesting that they are the top pathogenic candidate. What's really interesting is if you take the DHS that we prioritize and you cross them with an analysis that John and Matt Murano just published on whether or not variants in those DHS cause differential accessibility, you find that the DHS that we find are much more affected by SNPs in terms of accessibility than not. So it looks as though this is also a functional effect. And as you'd expect, all of these things are predominantly present in the immune cells in the collection that we've looked at and not in the others. And those are just the red. So I'm just skipping over this because it's a figure in the paper. I will leave you with one little thought. Here's a locus again that I've showed you before. It's the BAK2 locus that is of substantial interest in a number of diseases. This is the MS data. When you run the model, you find a single DHS that prioritizes the gene MDN1, even though all of the association is in the last intron of BAK2. The correlation is nice and clean with MDN1. There's some correlation with BAK2, but there are clearly samples that do not behave in this way. I don't expect you to read this, but I will leave you with this idea that there's association that is very strong across five diseases to this locus. We run the model independently on all five. In four of the five, we identify MDN1 as very clearly the prioritized gene. In IBD, it looks different. It also looks as though the most associated SNP in IBD is dramatically different to the others, and that prioritizes MAP3K7. And when you do this across all of the loci that have, if you look across all diseases for loci that have association to more than one disease, you find that although the most associated SNPs are generally not the same in any pairwise comparison, when you prioritize genes, you actually start seeing a much more frequent overlap. And the difference between these two is actually quite significant. It's got a sort of 10 to the minus three P value. And so we think that by doing this, we think that the association data is actually meta stable, if you like that idea. So the individual SNP that you identify as the most associated is at the mercy of vagaries of sample size and the specific sample of population that you see, and we understand this. We understand that's how genetics works. When you start prioritizing individual functional elements and genes, you start seeing a more stable pattern coming through. And so when you look at the shared association, and we know that these diseases are comorbid, when you start looking at shared associations across diseases, you tend to identify the same genes. And so by piling this up, I think we can actually increase our confidence that we're identifying genes that are relevant to disease. And that's important not just because it's cute, but I think it's important because I think you will realize how time consuming and costly it is to follow up a lot of these genes and chasing down rabbit holes because you looked at the wrong SNP is a bad idea, but maybe chasing down the right genes that have a lot of evidence across diseases is a good place to start. And I will leave it there and I'm happy to take any questions. Thank you. With respect to the SNP, thank you. That they were cis or trans or not close by or close by. Good question. That's a great question. I promise I did not plant that question, but thank you for asking it. Yes, we do. It's generally not the closest gene. In the vast majority of cases, it's not the closest gene, which I think is interesting. Because we're not looking at expression, trans is really hard to say. This is a locus specific analysis. So we didn't look for DHS in one locus across another one in part because there are still few enough tissues that you get DHS across the genome that are all correlated. And so you will see a correlation in trans that I think is maybe limited by just the pure number of tissues rather than a genuine correlation. So it would be very difficult to persuade yourself that the trans association versus a cis one would be meaningful. So with regards to the spectrum of the number of genes you find and or the number of DHS that you find associated with a risk loci. I'd like to have an idea of what you're seeing because in the case examples, as you show it's always one DHS that's dominant, one gene that's dominant. And I rarely conceive a risk loci as being a one DHS, one gene type of behavior. So I'm wondering whether that's the case or not. So the median distribution, I skipped over it really quickly because that's just a summary figure in the paper. The mean number of DHS that we prioritize with more than 10% of the posterior on each one is three per locus. And that's pretty clean across diseases. There are loci where it's substantially more and those either something is going on that we don't really understand or the model's wrong or both. But in general, we can get down to around three or four, a handful per lock, certainly. In 60 to 65% of cases, yes. In the others, we can't really prioritize a single gene and there's usually two genes or three genes that have equal evidence and the rest of the genes don't. And so if those...