 Why don't we get underway? I'm Dan Kastner. I'm the scientific director of the National Human Genome Research Institute. And it is my enormous pleasure to have the opportunity to kick things off this afternoon for the 15th installment of the Jeffrey M. Trent Lecture in Cancer Genomics. And this is really a lecture series of rock stars in honor of a rock star. And of course, the rock star that we're talking about that this lecture series honors is Jeff Trent, who was the inaugural founding scientific director at the time, the National Center for Human Genome Research, later to become the NHGRI. And Jeff arrived on campus back in 1993 with Francis. And before coming here, Jeff had been the director of basic science at the University of Michigan Comprehensive Cancer Center. And he came with Francis with the mission of establishing a new intramural research program on campus. And through his vision, through his energy, through the dint of his hard work, he quickly created an entity that was an engine that transformed the intramural program of the NIH. And I remember I was here at the time, and it really infused genetics and genomics into the culture of the intramural program of the NIH. And of course, through Jeff's auspices, it became really a powerhouse of genetics and genomics not only in the United States but around the world. During the time that Jeff was here, he was of course the chief of the Cancer Genetics Branch and made seminal contributions in terms of cancer genetics and genomics, particularly with regard to transcriptomics and the study of cancers. Jeff graduated, so to speak, I guess you could say, back in 2002 and left to become the founder of TGen, the Translational Genomics Institute in Tucson, Arizona. And he has been prodigious there as well, and we were fortunate enough to have him at our silver anniversary celebration last fall. He certainly has not been resting on his laurels and in fact has exciting new initiatives that he set up between TGen and the City of Hope Hospital. So in any case, in 2003, Eric, who became the second scientific director of the NHGRI, very wisely thought to create a lecture series in Jeff Trent's honor. And this lecture series has been a pantheon of leaders in the field of genetics and genomics. Beginning with Janet Rowley in 2003 and including Nobel Prize winners, clairvoyance, you name it, it's been a really amazing group of people. So at this point, I think I'm going to step back and turn things over to my colleague and friend and boss, Eric Green, who will introduce this year's Trent lecture. Eric, thank you, Dan. Well, picking up on what Dan just mentioned, since the start of the Trent Lectureship, NHGRI has brought an amazing group of researchers to NIH each year to give this annual lecture, including multiple Nobel laureates and others who have truly made transformative contributions to science. Just by way of words, don't let his youthful appearance fool you, though. Today's speaker, Dr. Jay Shindari, rightfully fits within that group of legendary researchers. Now Jay is a man of many titles. He's professor of genome sciences at the University of Washington, investigator in the Howard Hughes Medical Institute, director of the Allen Discovery Center for Cell Lineage, and director of the Broutman Beatty Institute for Precision Medicine. But none of those impressive titles should come as a surprise to you if you follow Jay's meteoric career as I have for the last 12 years. Jay earned his AB degree from Princeton University, and then an MD and a PhD degree from Harvard Medical School. He then got out of the gates very quickly and established an independent research program at the University of Washington, and he sort of has never looked back since. Jay's research focuses on the development and application of genomic technologies for understanding genome function and biology. And he's really been center stage in efforts to make genome sequencing faster, cheaper, and more readily applicable to many different areas of biomedical research. For example, his group was pivotal in the earliest pioneering efforts that reduced to practice the ability to efficiently sequence the protein-coding portion of the human genome, otherwise known as exome sequencing, and then apply such an approach to identify disease-causing gene mutations. Jay's technical innovations did not stop there, and his lab has amassed a remarkable track record of applying their methods to compelling scientific problems, and his cross-disciplinary group continues to develop new technologies. I am sure this is the kind of things he's going to describe in his talk today. I truly regard Jay as one of the most highly accomplished genomics researchers in this decade, full stop. But I'm not the only one with that opinion. Award committees have noticed the same thing. A sampling of his recent awards include 2012 Kurt Stern Award from the American Society of Human Genetics, the 2013 NIH Director's Pioneer Award, the 2014 Hudson Alpha Prize for Life Sciences, 2014 Scripps Genomic Medicine Award, the 2018 Richard and Carol Hertzberg Prize for Technology Innovation. He was named to CELS 40 under 40 in 2014. Oh, by the way, earlier this year, the U.S. National Academy of Sciences, of which Jay as a member, gave Jay the Richard Lowell's Berry Award for his pioneering work in what was known as the second wave of genomics. But besides all of his accomplishments, what I really want to point out is how incredibly generous he is with his time that he gives to the broader scientific community, and in particular to NIH. Shortly after I became an institute director, I asked and he accepted an appointment on NHGRI's advisory council. He hadn't even finished, I think, his full term on that when the NIH director grabbed him and also asked him to serve on his advisory committee to the director, which he still does. And in fact, Jay has just spent the last day and a half at the advisory committee to the director's lecture. And in meanwhile, you would think the NIH director would be incredibly exhausted and so overwhelmed having just led a day and a half of his advisory committee, but Jay was attractive enough as a scientific opportunity that the director is even here immediately after his advisory committee to hear Jay's talk. And as I had done multiple times, the NIH director, Francis Collins, has used Jay in many ways as a member of his advisory committee to the director, including having him serve on the working group for the precision medicine initiative that led to the launch of the All of Us Research program. So the service that he just constantly gives to the NIH has been appreciated, both by the peers that serve with him, but also by the NIH, and I know I speak on behalf of Francis Collins and being very grateful for his wise counsel. So I'm delighted to turn the podium over to an outstanding genomics researcher, a highly productive NHGRI and NIH grantee, a generous contributor to NHGRI and NIH, and someone who I personally regard as a good friend, a colleague, and a trusted advisor, Dr. Jay Shinduri. Thank you, Eric, for that extremely kind introduction. It's a real honor and a privilege to be here. And for a lecture to this institute in particular, to which I feel deeply connected at many levels, putting even aside the fact that you fund me, but the personal connections and the associations I've had in counsel and in other ways I think have really tied me to this group and this is really my community. So it's great to be here and to have a chance to tell you about some of our recent work. And so just to, you know, so broadly speaking, and it's nice of an audience, I don't necessarily have to apologize or try and mitigate this in any way, but our lab is a technology development lab and kind of a common thread that's run through our work. The whole way has been, you know, this idea of can we multiplex biology, right, at every level and as many flavors as we can, right? So just work, I won't talk about going back, but if you think about all of these technologies like next gen sequencing, exome sequencing, massively parallel reporter assays, kind of the common thread here of this and other work is really the performance of multiple experiments within a single volume, so to speak, right? That's a theme here and it's a very technical theme as opposed to a disease or a particular physiologic mechanism, but I think it's still a powerful one and still has a lot of runway I think. Okay, so I'm going to talk about three very different projects. Almost all of this work is, I mean all of it's funded by the NIH and a fair bit of it by NHGRI in particular. They're very different projects and trying to answer, I think, old questions with new methods, but a common theme is this idea of multiplexing biology, so to speak. So just to dive here into the first one, this is work from an MD PhD student, Greg Finley and a research professor at UW, Leah Sturrida in my group. And so just to frame the problem, so BRCA1, I'm sure the vast majority of you are familiar with it as a poster child for genomic medicine, right? This is a gene for which we know that having a pathogenic mutation will lead to a markedly increased risk for early onset breast and ovarian cancer. It's probably been more a sequence than any other single set of genes, if we're talking about BRCA1 and 2, first mapped in 1990 by Mary Claire King, subsequently cloned and the subject of a great deal of both scientific and legal focus. And one of the distinguishing features here, which is probably only true for about 50 or so genes right now, is that if you know that an individual has a pathogenic mutation, it is an actionable piece of information, right? There is something you can do that will make a difference for outcomes for that patient. The challenge for this and 50 other genes and what will undoubtedly be an increasing number of genes is that even though we have implicated the gene, we do not always know which variants are pathogenic, right? And so simplifying things somewhat, one might regard synonymous mutations as benign, nonsense mutations as pathogenic, but the stuff in between both in this sense and regulatory is challenging to interpret. So there are different approaches for solving this, right? One of the classic approaches to look for other patients that have that particular variant and see what happens to them so that you can make interpretations around future patients with the same variant. The challenge is that most specific variants are exceedingly rare. So even in cohorts exceeding 100,000 individuals, you might only see a particular variant once, right? So that paradigm which is kind of represented up here works, data sharing or genetics, but its throughput is low and we might well have to sequence everyone on the planet before we can extend it to every variant. Computational prediction is another approach. It's very scalable, right? You can make a prediction for anything. The challenge is validity, right? So we and others have developed methods, so our particular method is called CAD, which I think works better than average, right? But when you're talking about patients, that's obviously not good enough, right? And so this is challenge. And so the third category here, functional assays, are by and large considered to be a valid approach, experimentally characterizing an individual variant. But historically, this has been done in an ad hoc kind of piecemeal manner, often very retrospectively, where it may or may not be of use to an individual patient rather than in a systematic way. OK, so over the last few years, we and others have been trying to develop methods for multiplexing these sorts of functional approaches in the context of regulatory elements, but also in the context of variants that impact genes. And so the basic broad paradigm is as follows, right? So one synthetic uses any number of synthetic methods to build a library of variants of a particular sequence of interests. One then introduces those into some biological system, let's say a population of cells, and then performs a phenotypic assay. And then finally, using perhaps sequencing, perhaps something else, one can quantify the relative effect sizes of individual mutations in this population. And so a key point here, or two key points, one is that this is one experimental workflow, so I'm characterizing many mutations, but I'm doing it as part of one experiment. And kind of the follow on from that is a powerful aspect of this relative to approaches where you are characterizing one mutation at a time is that you get a distribution of effect sizes, right? And that's a powerful thing to have when you're trying to interpret any one mutation, to see where it sits relative to all other possible mutations. OK, so classically, functional assays, people have been doing these for decades, typically on epizomal reporter vectors, right? You put the cDNA or you've got your regulatory element, you make variants, and you look at them. And the frequent criticism of this approach is that, well, we don't know how different chromatin on epizomes is relative to chromatin in the native genome. You're also, if you're looking at a cDNA as opposed to a gene in its native exonant trans structure, you don't know whether you're missing some key parts about native regulation, right? And so this is not the genome, right? But nonetheless, it's what we could do. We no longer have quite the same level of excusability, right, with the development of CRISPR. One can now make changes directly to sequences of interest in their native context. So this topic doesn't likely, I hope by now, doesn't need much introduction, but just to, as a brief refresher on to the ways in which one can use CRISPR, right? So CRISPR is generally used as one approach to introduce a double-stranded break to a particular place in the genome. That's at least one way of using it. And then if you provide no sort of nothing to the cell to repair that double-stranded break, it will try to repair it just by joining the ends. Sometimes it will make a mistake, which leads typically to deletions or insertions. And this can be used to disrupt genes. So for example, by introducing a frame shift. The other way in which it can be used, or one other way, is to exploit homology-directed repair, where you provide a repair template to the vicinity of where you are cutting. The cell uses that repair template, and you introduce a precise edit, a precise variant. So we're mostly going to be talking about this avenue here. So saturation genome editing, which is a method that Greg, together with another graduate student, sorry, undergraduate, now a graduate student in Jonathan Pritchard's lab, Evan Boyle, came up with a few years ago, is this approach where we try to multiplex the process of genome editing with homology-directed repair. So in a nutshell, we have a population of cells. We are introducing a cut to one particular location in the genomes of those cells. So there's one place we're cutting. But rather than providing one repair template, we're providing a library of repair templates. And so this cut can get repaired in any number of ways. It might not get repaired at all, in which case you have NHEJ and an insertion of deletion, but you might introduce one of these variants that you programmed on a microarray. And you end up with a population of cells that each have a different edit. So we developed that method, took us a few years to think about how we should actually use it. And so Greg had this, I think, remarkably good insight from Keen leading reading of the literature, where there was a paper from Thrill Brimelkamp's group, one of the first CRISPR screens, which was done in this cell line called HAP1. This is a cell line that was derived from a cancer cell line, but has a particular property of being nearly completely haploid. So there's only one copy of the genome of every chromosome. And what Greg noticed in the data from that paper was that it appeared, according to their results, that BRCA1 was essential, meaning that if you cut it and introduce a frameshift, the cells would die. And we confirmed this in our own lab here. We're cutting the cells at one locus, HPRT1. Cells are fine. We cut it, BRCA1, they die. And then just as a corollary to this, we also looked at other, so one important question here is, why are they dying? So if you're thinking about these functional assays and trying to relate them to patients, as we ultimately are going to be doing, is the function that is resulting in cell death in this system have anything to do with why a patient with a BRCA1 mutation develops breast cancer? And that's a very good question to ask. One suggestion that it might have something to do with it, well, two things. So one is that these cells are ES-like, embryonic stem cell-like, and mouse ES cells are often used as a model for studying BRCA1 variants and these kinds of functional assays. The second is that if you look at other genes that are essential in the cell line, all the homology-directed repair genes are on that list, or many of them are, suggesting that that pathway is actually important for these cell survivals. And I should have said this, but I didn't, but I'll say it now. And this can be a little bit confusing. So I'm using the phrase HDR or homology-directed repair in two different ways. So on one hand, it is the mechanism by which we are repairing CRISPR cuts. In addition to that, it happens to be the physiological function of the cell that is thought to be important for BRCA1's role in breast cancer, right? So we have the two things together, and I'm kind of using the term in both ways, just to clarify that. So the fact that other HDR genes are essential in the cell line gives us some confidence that maybe we are capturing the physiologically relevant function. So BRCA1 is an enormous gene, right, stretching over 81 kilobases, and just the ORF is 5,500 basis. And so within the space of an MD-PhT's PhD component, we didn't necessarily want to do the whole thing. And so instead, we set about picking particular parts of the gene that we thought would be the most fruitful. And so these two domains, the ring domain and the BRCT domain that are both one, thought to be critical for this HDR activity of BRCA1. And two, happened to be where all of the known pathogenic, mis-sense variants of BRCA1 lie, literally all of them, right? So if we're gonna focus our effort in time somewhere as a proof of concept, it made sense to do so on these 13 exons, right? Okay, so the actual experiments that Greg and colleagues did were as follows, we proceeded one exon at a time, made a cut in these HAP1 cells, and then repaired with a library of every possible single nucleotide change of that exon, right? So we're making every possible swap out, not only of the exon, but going about 10 basis into each intron on either side, right? So in total, about a hundred base pair region, right? And so in total, we did this across 13 exons, which added up to around 4,000 point mutations. So just to drill into the rest of the experiment, focusing on one exon in particular, so one challenge here is that even after a lot of optimization and even engineering the cell line, this process of HDR is not very efficient in mammalian cells. Like, so as Meru's work, it's fantastically efficient in yeast, right, but not so much in mammalian cells. And so the nice part about the system, in addition to the other aspects I mentioned, is that when we introduce frame shifts, which are the other 80% here, those cells just die, right? So we don't have that background. So we introduce these edits and then we simply wait, right? And the expectation that if this is essential, if we're introducing a pathogenic mutation or a function compromising mutation, those cells should die, whereas cells with benign mutations should live. So now this is just plots showing the enrichment of distribution, the distribution of counts for different variants over time. And if we look at this as a ratio of enrichment or depletion over the time course, comparing the 11 and five day time point, and here we're just looking at nonsense mutations and synonymous mutations, we get a nice bimodal distribution, right? With a handful of exceptions I'll mention again later, but for the most part, synonymous mutations are not depleted, whereas the nonsense mutations are depleted. Now if we look at the missense mutations, we get a nice bimodal separation, right? Where this particular exon, about half of them group with the synonymous mutations and about half with the nonsense mutations. Okay, another way of viewing the same data, this is looking along the exon and the colors correspond to these different functional categories. And we're just looking at whether particular changes increase or result in depletion. And so just a few phenomena I wanna point out, one is that the nonsense uniformly go down as you can see the synonymous and the regulatory mutations and the synonymous are lumped together here, but the ones kind of in the exon you can see are by and large benign. But looking into the at the junction, we see this particular junction, it tends to be pretty unpredictable from junction to junction. Not only the canonical splice pair of a splice basis right at the junction, but also the first coding triplet as well as seven basis into the intron that the gene is almost completely intolerant to changes. And this varies by junction, but this is a particularly striking one. Another cool thing is if you look here, you can see a handful of synonymous changes that result in depletion, these blue lines going down. So what's going on there? So one cool thing here is that you could actually look at the RNA levels. We can sequence the RNA and ask, looking at the point mutations on those RNA molecules ask about their levels. And if you look at these apparently non-functional synonymous changes, we see that many of them do in fact result in, really all of them result in depletion of RNA levels. And if we look in detail, we see often they're creating new splice sites and probably compromising function that way. There are also mis-sense changes which you might naively or naturally think have their effects mediated by changes to the protein, but in fact maybe acting at the RNA level. Okay, so all together we were able to hit about 96.5% of the roughly 4,000 possible point mutations in these 13 exons that we targeted. Here's the overall distribution, right? The handful of exceptions where we see, so if we now, okay, so important question to ask here. So we see the patterns we expect for the most part. One question is, is this actually valid clinically, coming back to what I said earlier? And if we look at where mutations that are scored in ClinVar, which is kind of the database of clinically adjudicated variants in genes like BRCA1, we see very strong agreement between our data and those calls, right? With again a handful of exceptions. This is a little bit overly fair to ourselves in the sense that we're including nonsense and synonymous mutations here, but even if we only look at the missense mutations we see good agreement. The handful of exceptions we've drilled into further, including by looking at how and why they were included in ClinVar. And in each of the cases we're able to convince ourselves that there's a reason that they should not have been included and where our annotations may actually be correct. And so obviously these individual cases, edge cases need careful attention, but I think it's also important to recognize that databases like ClinVar are not perfect, right? Okay, and yeah, so in these 13 exons we've achieved nearly perfect classification with these experimental approaches. We've got some variants that fall in this intermediate group and then you can get some interesting statistics out of this, like the proportion of missense changes or overall changes that are functional versus non-functional, as well as we were surprised by the rather substantial number of variants that appear to have their effects by mediating disruptions of expression rather than acting at the protein level. Okay, and we're hoping this is a kind of a paradigm potentially moving forward if we can scale this further. And I think within six months we've had quite a bit of enthusiasm from national bodies like Enigma that try to support common standards for variant adjudication where they are quickly now adopting these and the variants are all in ClinVar and that kind of thing. Okay, second part here. This is a completely different question and this is work that I should say is part of the encode consortium, encode four in particular this new part of encode that focuses on the development and implementation of technologies for functional characterization of the non-coding genome. So one of the kind of the pivotal challenges in this area, so encode has been enormously successful in characterizing candidate regulatory elements throughout the genome in a broad diversity of cell lines. Our belief that these actually are what we think they are is based on a historically small, like a small number of paradigmatic examples and the vast majority we haven't actually functionally tested to confirm that they are what we think they are and moreover to understand them more deeply. So in particular the question that I think is quite fascinating and important, particularly if we want to effectively follow up on genome-wide association studies is what genes do these elements actually regulate, right? And there are correlative approaches for trying to get at this, but we're short on functional methods. Okay, so the approach that we've been developing is actually inspired by human genetics and by the EQTL framework in particular, right? So if you think about what we're doing with EQTLs, when we're looking at what enhancers control a particular gene, for example, you have many individuals, each individual is a combination of a different set of genetic variants throughout their genome, and then we also make expression measurements on this population and then we perform a bunch of statistical tests. And in every test we're saying, if we look at this particular variant, we look at the subset of individuals that have that variant and we compare expression of nearby genes for any sort of correlation with what variant you have at that particular position, and this is how we find EQTLs, right? And then we're leveraging the same population again and again and again, basically different combinations of individuals that have the other variants, right? We can use that to do additional tests, right? So we're really exploiting the random assortment of variation in the human to get as much as we can out of this. Okay, so now, you know, this works and you can use this kind of framework, for example, to link a variant and an enhancer to regulation of a particular gene, but there are some important challenges. So one is that the framework is limited in scope to standing or common variation, right? You need the variant to be common enough that you're powered enough to actually see an effect. It also has to happen to be a variant that falls in an enhancer and disrupts its function in order for you to find something. And the second limitation is population history of humans where because of linkage disequilibrium, you might have a large number of variants in a haplotype block. You associate the haplotype block with the change in expression, but you don't know which variant is actually causal for the effect that you observe. Okay, so now, inspired by this framework, kind of imagine a slightly different world here where rather than people, we have cells, right? And rather than naturally occurring genetic variation as our perturbations, we have programmed CRISPR-mediated perturbations, okay? And so, but similar to QTL studies, every, instead of human cell, every cell has a different combination of perturbations, right? But, you know, and like QTL studies, we're gonna be leveraging the single data set where we've genotyped all the cells and we've expression phenotyped all the cells. We're gonna be leveraging this again and again to look at different combinations of cells that harbor different, you know, that happen to harbor a particular enhancer perturbation versus the subset that don't and looking for changes in expression of genes that are located nearby, right? So it's extremely analogous. Instead of CRISPR with the double stranded brakes I mentioned earlier, we're using CRISPR-I where a large crab domain is hooked up and so we're not actually introducing a cut, rather we're kind of epigenetically shutting down the enhancer where the guide targets this complex tube. Okay, so we're not the first ones to work in this general area, I should mention. There have been quite a few papers from a number of groups, basically developing some of the key tools here and applying them. But typically in a bulk fashion, right, where you've got a population of cells, let's say you put a GFP reporter on this gene, you're now introducing CRISPR-I perturbations to the vicinity of this gene, you're sorting the cells maybe into high and low expression bins and you're comparing which guides are present in each bid. So this is a, it's multiplex in a sense in the sense that you're looking at many guides and many enhancers, but you've really built an assay for one gene, right? And really focused your efforts around that and that obviously takes a lot of work to try and build a gene specific assay. In contrast, what we're trying to do here is a generalizable framework where we can look at many candidate enhancers, potential effects on many transcripts in a genome-wide way that we don't have to, we don't have to optimize for any particular target, right? And again, we're leveraging this one experiment. We're gonna introduce guides. I'll just go through it here. So okay, so in the actual experiments we did, so I'll talk through two experiments. One was a pilot, one was more at scale. In the pilot experiment, we tried to target 1,000 enhancers across the genome in K562 cells. We're putting in 15 guides per cell and I'll talk in a bit about how we're doing that. And we're profiling 50,000 individuals where the individuals are cells and we're really analyzing this exactly like an EQTL study, literally. Okay, so what enhancers are we choosing? So we're basing our decisions on the data from the ENCODE project for K562 cells on this particular experiment, taking the kinds of marks that you might associate with enhancers in the cell line and targeting them specifically, right? Because our question here really is trying to figure out what genes they regulate and learn something about that. Okay, so yeah, so we have a bunch of, quite a variable, it's not one criteria, but we actually have a lot of representation of a lot of different patterns of chromatin marks here. Okay, so how do we do this? We're introducing the guide RNAs via lentivirus at a high multiplicity of infection, right? So every cell's getting a random combination of on average 15 of these guides. Then we perform single cell RNA-seq, in this case on the 10x genomics platform to measure expression. And one key point here is that we're able to take advantage of a defector developed by Kristoff Box Group and by Datlinger at all described here called CropSeq where the guide RNA actually gets expressed as a Paul II transcript as well. So you can pick it up in your modified 10x genomics assay, which there's some tweaks here that we describe in this paper down here. One of the, if you're in this space, one of the key advantages of this vector over similar methods like perturbs-seq is that it kind of brilliantly avoids this phenomenon of lentiviral recombination that can otherwise scramble associations between barcodes and guides. It's a fairly nuanced technical point, but it's a fairly important one if you're actually trying to set up an experiment in the space. Okay, and the key point here is multiplexing gets us a lot of power, right? So by doing 15 guides per cell, we're actually able to get the same power with 50,000 cells, profiling 50,000 cells as would be, we would need 750,000 cells if we were only putting in one guide per cell. Okay, then as I said, we're performing essentially like cis-EQTL tests, where we're comparing the expression of genes that have or don't have a particular guide. We're looking at all genes within a megabase of the enhancer that that guide is targeting, right? Which is kind of typically what you do in these cis-EQTL studies. So as a positive control, we can look at enhancers, I'm sorry, not enhancers, at promoters. If we target CRISPR-i to promoters of particular genes, and basically do the same thing, our strong expectation is that we should be knocking down the gene that that promoter belongs to. And in fact, here 94% of the time, we do get detectable knockdown of the transcript when we do these association tests on guides that are targeting the promoter itself. One thing that's striking in kind of trying to design this experiment and look for positive controls is how few like hands down foreshore enhancers there are, where you definitely know the target gene, even in a cell line like K562 that has been studied to death. And so here our positive controls are the beta-globin LCR, which of course is one of the kind of paradigmatic examples. And in that case, here we are targeting enhancer and we do in fact see knockdown of the appropriate globin transcripts as expected. Okay, so now what about the actual experiment, the experimental guides? So here these are the ones where we're targeting an enhancers in orange and then we have some non-targeting controls in gray. And when I'm plotting here is a QQ plot, so often used for GWAS studies or UQTL studies, which is a distribution of expected versus observed P values. And as you can see, we get good agreement with the expected distribution for the non-targeting controls, but we have this kind of uptick in an excess of significant P values for the actual, when we're targeting enhancers, right? Okay, so in total, targeting about 1,000 of these candidate enhancers, we get about 145 of these enhancer gene associations that come up as significant under an empirical FDR of 3.5%, we call these CRISPR-QTLs. And just for clarity, we are seeing all kinds of things, right? So it's not one enhancer, one gene all the time, rather we have some enhancers that target multiple genes, some genes that are targeted by multiple enhancers that we tested, but 145 of these unique pairings. Okay, 145 out of around 1,000 tested, so about 15%. So I just want to bear that number in mind for consistency with the next experiment that I'll tell you about. So this is just showing one example. This is a gene NMU, neuro-median U, and here we're targeting, so this is our non-targeting control, so here these plots are each just showing no GRNA versus yes GRNA, comparing the distribution of expression values for large numbers of cells, and here we see no difference for a non-targeting control. We look at the promoter, so here if we're targeting the promoter and looking at that guy, we see a big difference, 77% knockdown. We look at this target candidate enhancer here, doesn't appear to be significantly associated, it's kind of a weak biochemical marks, but we look out here where we actually had three different elements that we were targeting that happened to be kind of clustered together. We get consistently strong knockdown of the NMU transcript. So here over a distance of roughly 50 kilobases, we can make a reasonably strong assertion that we have a cluster of enhancers that regulate NMU expression. Okay, so now we tried to scale this up and target 5,000 enhancers with many more cells and a higher number of guides per cell for hopefully a more sensitive experiment. We tried to do this in a little more, I'm not gonna go through all the details here, but a slightly more intelligent way based on what we learned from the first experiment about what might be a hit and what might not. And again, just really just a, at many levels just a bigger experiment, but profiling roughly 200,000 cells where given the MOI we had, we would have had to profile 6.5 million cells to achieve the same power if we hadn't been multiplexing. Okay, so we ended up with, so just basically the exact same framework and empirical FDR of 10%, we ended up with about 660 of these CRISPR-QTLs. So again, out of about 5,000 tested. So again, roughly north of 10%, but less than 20% of these CRISPR-QTLs. Okay, so how can we, what determines of our 5,000 targets which ones are actually coming up here? So one thing we can look at the strength of various encode marks and whether they're actually predictive of whether something will end up as a EQTL. And in fact, they are, right? So here, looking for example at H3K27 acetylation, the strongest quintile of peak strength does do a better job of predicting success in this assay. Other marks, P300 does the same, H3K4 monomethylation less so. But you can imagine using these kinds of data which I would say are not a, it's not really a gold standard until we're going and deleting each element. But I think it's closer to a gold standard than we are right now. You know, we can use this to train better models of how to predict enhancer gene relationships and understand our false positive and false negative rates a little bit better. Other kinds of data are also lined up with this. So here for example, looking at high C data which has been deeply sampled in this particular cell line by raw et al. And we can see that if we look at our CRISPR-QTLs we do see a strong enrichment for loops over a distance. That being said, it's also important to recognize that there's a lot of loops that did not end up as functional interactions as our data and our data. And a good number of CRISPR-QTLs that are not predicted by loops, right? So these associations give us some more confidence but it's also telling us something about the fact that we're looking at the elephant a lot of different ways and not always seeing exactly the same thing. Okay, so for me the highlight of this whole experiment was this question which is something I think has never really been empirically asked, right? We talk about how far enhancers are from the genes they regulate. People often cite the most extreme examples, right? Megabase, two megabases. And there are certainly real examples like this but what does the distribution look like, right? That's an important question. And I'm not saying this experiment is perfect. It certainly isn't but I think it is a systematic effort where we know exactly how we designed it. So this is a distribution of all the tests that we did in terms of distance from the element that we tested to the promoter that we were testing against, right? The median distance here is 440 kilobases. This is the distribution of CRISPR-QTL hits, right? Where we see there's a very strong, we definitely have some out here that we believe are real including that 50KB example I showed but the vast majority of associations that were successful are pretty close to the gene that they're targeting, right? So a frequent thing to do in these kinds of analyses historically is to just, if you're trying to figure out what gene and enhancer regulates is to take the closest gene. Does that actually work? So here we're looking at the closest express gene to the hit and we do actually see that ballpark half the time, maybe even a bit more than that, the closest gene is the one that is regulated but there are a good number of examples or let's say a third where that's not the case, right? And another concern here that as soon as you see this you're like, well, maybe it's just it's all right up against the promoter and that's not actually the case. If we zoom into this 25KB window, we see they're actually kind of uniformly distributed across that window with only slight enrichment here in the final five kilobases and we're not considering anything within a kilobase of the actual TSS, okay? And again, the kind of information that can be used to build empirically predictive models of these sorts of links. Okay, so that's that. Third part here, totally again, complete different topic. How do we interpret signal from genome-wide association studies? This is work from Darren Kusanovic, who was a postdoctoral fellow in the lab, is now a professor at the University of Arizona in Andrew Hill, who until three days ago was a grad student in my lab and they worked together on this. Okay, so as you all know, GWAS studies have, depending on glass half full, kind of glass half empty, there are now over 100,000 reproducible associations to common diseases, right? We've learned things like that the signal largely maps to non-coding regions and we've also learned things about the architecture of disease in other ways. So for example, it's largely small effect sizes in these non-coding regions that collectively explain the substantial fraction of heritability. But moving to the next stage of actually trying to nail the genes has been challenging in part because of some of the problems I highlighted earlier like linkage disequilibrium and not knowing what genes a particular variant, non-coding regulatory variant actually regulates, right? Okay, in the last couple of years, and again I think people are probably familiar with this, but over the last decade really, we've seen this explosion of single cell technologies and I'm gonna come back to the GWAS, I didn't make a bit of a transition there. Explosion of technologies at a logarithmic scale or exponential scale where every year you're seeing almost an order of magnitude increase in the kinds of studies that one can do with these technologies. So okay, so I'm gonna talk about some of our single cell work and then I'm gonna come back to how this relates to GWAS towards the end. So our particular flavor that we developed together with Cole Trapinals group and collaborators at Illumina and a particular Frank Steemers group is this idea of combinatorial indexing and the question we started with is, can we, much like you sequence without isolating each molecule in next-gen sequencing, can we profile the molecular contents of large numbers of cells without ever actually isolating single cells? So we start with some large number of cells. These can be nuclei cells, they can be fresh, they can be fixed, they can be frozen. We split them out to some number of wells on a plate and then we do the following. So the procedure I'm gonna describe to you is kind of, it's a combinatorial indexing strategy where we are going to be labeling, our ideal endpoint is that we've labeled the nucleic acid contents of each cell in a unique way that's different from every other cell in the population. So the way that we do this is to leave the cells intact, distribute them to these wells and then to start performing in C2 molecular reactions inside the cell or nucleus, right? So in the particular example I'll show here, we're just, let's say we're using reverse transcriptase to make first trans cDNA but we're appending a barcode, right? And that barcode is inside the cells which we've permeabilized. We then pool the cells from all these wells back together and we split them back out to a new set of wells. And then we perform some other form of molecular indexing using kind of standard biochemistry and again kind of inside the cell. And then we're all done. We can basically recover all of the nucleic acids which have been labeled with these combinations of indexes, lyse the cells, get them all together and then sequence those molecules. And the key point here is that every nucleic acid that we're interested in is now tagged by some set of barcodes that identify what cell, you know, that are shared if they came from the same cell, right, in principle. This is kind of subject to the birthday problem so by pure chance just like two of us might have the same birthday, two cells might traverse the same combination of wells so it's not perfect but you can predict, just like I can predict how many people here have the same birthday, you can predict how many of these quote collisions we get. And so this was developed four years ago now by this overall framework by Darren and Riza Daza as a research scientist in my group. And we call this single cell combinatorial indexing or SKY. And we and collaborators over the last few years have developed a lot of different flavors of this basic idea. So we started with a tax seek for measuring chromatin accessibility, then high C for nuclear architecture, DNA sequencing was developed by Andrew Adiot, OHSU is a former student of mine, transcriptome profiling which I'll mention briefly on the next slide, methyl profiling also by Andrew and then co-assays of chromatin accessibility and expression in the same cells. And recently we also have a paper on nascent transcription. So really any kind of standard biochemistry if you get a little creative, you can adapt it to this basic framework of combinatorial indexing to get effectively a single cell version of that assay. Okay, so mostly gonna talk about single cell attack seek but I wanna briefly, I'll mention some other things we've done. So Jun Cao in the lab, developed this SKY RNA seek protocol, first round of indexing with a reverse transcriptase, the second one with PCR, and we first use these to basically make an atlas of transcription of the worm, of the L2 worm where we're kind of by getting 40,000 transcriptomes we're over covering the cellular content of any single worm, right? So it's instead of 50X coverage of genome, it's 50X coverage of the cells, you could say. And Bob and Colin others have recently done some really fascinating work that's on bioarchive if you have a chance to look at it, connecting this back to Salsons lineage in a really remarkable way. More recently Jun developed this three round version of combinatorial indexing where he himself in the lab over about a week and a half was able to profile I think, what is now I think one of the largest publicly available data sets of single cell RNA seek, right? This was one and a half a week experiment that one NovaSeq run in which he profiled two million cells and here we applied it to basically profile mouse embryo organogenesis from E9.5 to E13.5 Jun together with a postdoc, a multispeel and all the rest of this talk is, I should say, collaborative work with Cole Trapinals group who's also in my department. And I refer you to the paper because I don't have time to talk about this but the stuff you can see really around development of all the various lineages of the organism is really, you know, we've only scratched the surface of what we could potentially look at with these data. Okay, so what I am gonna talk about in the last, I'll try to get this through in the last four minutes is single cell attack seek where we're doing basically TN5 transposition with an index barcode pooling and splitting and then an index PCR to get the second index on there. I'm gonna skip that. Darren and Andrew applied this last year to the mouse and in particular, 13 tissues of the adult mouse, effectively, an atlas of mammalian chromatin accessibility that was as comprehensive as we could get it and we ended up profiling around 100,000 cells. You know, we get clusters, this is a T-Sne. You know, we could look at, you know, for example, hepatocytes are boring, astrocytes are more interesting if different subsets, even at the chromatin accessibility level. I don't wanna offend any liver people but at least in the resolution of our data. And so I think, you know, similar to analogous to encode, I think there is a lot of value over gene expression data in this kind of regulatory biology, right? Just like the gene regulation parts of encode were in many ways, at least from my perspective, the most exciting relative to just looking at expression and this is just an example of the kinds of things you can see when you extend that to single cell and here looking at astrocyte subtype specific regular accessibility, presumably enhancers, right? And, you know, endothelial cells, right? Subtype specific enhancer. Okay, what does any of this have to do with genome-wide association studies? Okay, so a couple of years ago, a number of groups, including John Stamontinopolis's group, showed that you do see this enrichment of the signal of GWAS heritability in accessible chromatin of cell types that are relevant, I'm sorry, of tissues. This was through encodes, so they're looking often at tissues or cell lines of cell lines or tissues that are relevant to the disease in question, right? So if you look at inflammatory disease, you might see that the heritability signal partitions to, you know, PBMCs, let's say, or the open chromatin from PBMCs. Okay, can we extend this kind of thinking to single cell resolution chromatin accessibility data sets? And so the challenge here, practically, is that we have human GWAS data, no mouse GWAS data, and we have a mouse atlas of chromatin accessibility, not at least yet, no human atlas of chromatin accessibility, and so how do you bridge these things? So we are related, and you can just lift the coordinates over, and it still works amazingly, right? And so we can lift over the accessibility coordinates from mouse to human, and look at enrichments of heritability from GWAS in those data, right? So, okay, so even if we forget we have single cell data, can we even reproduce what John and others did, you know, and just treat these as individual tissues, and in fact, we do see that this kind of works, right? So if we look at neurological disorders, we see enrichment for the prefrontal cortex tissue, if we look at, this is our data, we're just forgetting that we have single cell data and collapsing it, and so on and so forth, you get it. But now if we look at single cell data, right, our single cell clusters, we have 85 cell types against the GWAS here, which are the columns, it's much more granular. And I'll give you an example of what I mean by this, if we look at bipolar disorder, and the maximal enrichments as far as cell types, we see a strong enrichment for, top enrichments are all for cortical neurons, right? And excitatory neurons in particular, which is kind of what you'd expect based on the biology. In contrast, if you do the same thing for Alzheimer's disease, not a single neuron pops up, right, but rather the top association is with microglia, which as we only recently understand is probably the cell type of principal relevance for Alzheimer's disease. Okay, so now with the availability of UK Biobank, we can just extend the same thing to all 2,500 traits that are available there, right? So still with our 85-mile cell types and these 2,500 traits, and this is a lot of fun to play with, we have this all online, if you're interested in what cell type is responsible for your left arm subcutaneous fat. But here are just a few examples. So looking at gout, the top enrichment is for, not just for kidney, but for kidney proximal tubule and a particular subtype of cells in the proximal tubule. Looking at things like emphysema and systolic blood pressure, they both pin to endothelium, right? But to different beds of vasculature, right? So venous lung, venous lung, in the case of emphysema bronchitis, and heart arterial in the case of endothelium, right? And then the last one I'll end with is just that the pain in your, the pain when you walk is in your mind, inhibitory neurons for that one. Okay, so yeah, so this is kind of giving you a glimpse of I think these multiplex methods, you'll have a lot of runway in terms of how we can apply them to contemporary and important questions and genomics and biology in general. And I'll, yeah, I'll end there and just thank the people in my lab who did the work and thank you all for listening. Well, Jay, thank you so much for that really terrific presentation. We have just a small token of our appreciation of your talk here, a little bit of a, something that you can put on your wall if you want. Eric, do you wanna come on up and maybe we can pose for, I'll take questions. Yes, and take questions. All right, yep, two microphones. All right, so. You had time to think about your questions during picture photo session. So maybe to start out, you showed in one of your final slides, gout as one of the diseases that you were interested in and you had the kidney tubule highlighted, but nothing about inflammation, an area of great interest to me and you know, in fact, one can target IL-1 for example in terms of the episodes of gout. Did you find anything? So I should qualify how I actually described the result because I think we did these analyses in a way where we're looking for the top enrichments and we haven't, one thing that's lacking from this field of heritability partitioning is something that's analogous to the way that you look for conditional contributions of SNPs to GWAS association. So what we should do is say, okay, proximal tubule, conditional on that, what's left? And we haven't done that yet. There aren't good methods for that yet. I think that's something we're thinking about, right? Okay, well, while we have that interchanged, people with the microphones. Okay, this was a really, really interesting talk. Thanks, Jay. I find your E, the CRISPR-QTL data particularly exciting. So up the ones that you found from your CRISPR-QTL heads, the ones that didn't have concordance from the high C data, what do you think is happening for those? So there are about nine explanations, possible explanations that are not mutually exclusive. So just to rattle them off partially, right? Ineffective CRISPR-I, right? That's certainly a possibility. Experiment is not sufficiently sensitive to detected actual effect. The effect is they are bona fide enhancers, but we're somehow missing the right context. Maybe the cells need a stimulation of some sort that might provoke it. Maybe they were enhancers, but we're not going through the whole developmental history of the cell. This is just looking at one static time point. So that lack of dynamic framework is somehow resulting in missing it. What am I, I'm missing at least a few here. Shadow enhancers, right? Meaning that they might be buffered by other effects and we haven't been looking at combinations. Or they're not enhancers, right? Maybe they are coding for some non-coding RNAs that might act. Yeah, could be something like that. Or my own weak, completely intuitive, not data-supported suspicion is that they are biologically relevant. But we're maybe lumping everything with those marks as enhancers when there's more heterogeneity in the actual functionality than we are appreciating. The other eight explanations are also on the table and we're trying to now systematically go through them. Jay, that was beautiful as always. How, as we expand these techniques to labs across the world using them, how much do you think we can realistically bring down the cost to make these more accessible to labs on a university funding scheme? Which technology were you talking about? I think all the multiplex technologies, there's obviously gonna be different pros and cons for each, but obviously not everyone has the kind of funding to do the experiment they want to do. So how can we make these technologies more affordable so people can design better experiments? So I would actually, I would say that the vast majority of what I described is not that expensive with the exception of the single cell profiling with the 10x platform, right? So that, doing those 200,000 cells set us back a fair bit. Our two million cell experiment, which we've now done more than once, set us back substantially less than that 200,000 cell 10x experiment. So it's not, there are ways of I think doing it that would make everything I talked about cheaper, right? Yeah, but I do, the sky paradigm in particular, I do think that's the path. If we want sublinear scaling of costs, that it's hard to, you can't beat exponential, you need an exponentially scalable technology, right? Absolutely. Maybe it'll take five rounds, but you know, get there. Thank you. So I noticed in your intergenic study, you were looking, or your enhancer study, you were looking sort of prior to the promoter. Did you happen to look at any intergenic enhancers and if so, what did you find? It's a great question. We deliberately excluded them from the, you mean gene body enhancers, right? We deliberately excluded them from the design because we were worried, as other studies have been, that sitting down a giant crab and CRISPR-i in the middle of the transcriptional unit would screw up expression in a way that wasn't actually connected to enhancing activity. But so it remains a bit of a mystery. I mean, certainly there are entronic enhancers and tons of them. How to study them? I mean, if we could change this to a deletion-based framework, rather than CRISPR-i, I think I'd be more comfortable with that. Thank you. But it's important. Hi, Jay. In your CRISPR-i data, when you look at a distance effect, did you try to fit a function to see the dependency on distance and how that varies across different loci in the genome, and whether that's quite universal across different cell types and different genes and pathways? So everything you saw is everything we have, meaning that that's a collated distribution across all genes. So we don't have the power to be able to make locus-specific distinctions. I do think we probably could look at it more and try to see more patterns than we have. But I think we would need to do a lot more cell lines and a lot more tiling before we're able to make those kinds of generalizations. But that is kind of the goal, to have a function. You could imagine a function that takes a new account distance. Sequence composition. Well, sequence composition and biochemical marks, high-c data promoter enhancer pairing in terms of motifs and tries to make a best guess. But that decay you saw, it's like a 1 over x kind of thing that you see? We did not fit it, but we probably should have thought about how that relates to genome confirmation, which is a great question. But we haven't. OK, thanks to do that. Cool. So in the context of the CRISPR-QTLs, I was wondering, are you thinking of looking for interactions and pairs of enhancers that behave differently from any one of them? Yeah, so kind of the next phase of this will be to ask, can we use the same framework to really look at one of those several alternative hypotheses for what's explaining the other 80% or 90% is looking at combinations. And you can easily imagine the same framework being used for that. So that's underway. But no answer. Looks like we've exhausted the questions, at least for now. So Jay, thank you again. Thank you. Thank you. Thank you very much. Thank you.