 So, it's an honor to be introducing our guest today, Dr. Doug Pauler of the University of Washington in the Department of Genome Sciences. And Doug did his PhD at the Strips Research Institute in San Diego. We were just hearing about your exploits in stealing, or maybe I'm not supposed to be saying that in the microphone. After his PhD, Doug went on to do a fiesta at the University of Washington in Stan Fields' lab, where he developed the technique known as deep mutational scanning. Deep mutational scanning is basically an approach to make every possible amino acid substitution in a protein of interest and then study the effects of those mutations or substitutions on the protein function. And this has clear applications in so many aspects of biology. I mean, for instance, to understand the potential disease consequences of variants in disease genes or to understand evolutionary trajectories of proteins such as say the spike coronavirus protein, or even just to understand general questions of protein function and protein biology. So, in 2012, the Genome Sciences department saw the potential of Doug's research program and hired him on as faculty. I would say about a decade later that potential has been absolutely realized and deep mutational scanning is used throughout research and has really changed the way that we think about genetics experimentation. On a personal level, I'd say it's really had a big influence on me and the research that my lab does. So, Doug, we're really excited that you're here and excited about your talk. Oh, thank you so much for that kind introduction. You might. Well, you did a great job telling people about about sort of what my lab has focused on, at least in part over the last 10 years. It's really exciting to be here and thanks all of you to all of you who are in the room and on zoom for for your attention. I was just telling Eric that the last time I was here was actually interviewing for Stadman fellow position at an ancient agent get an offer but I did get a chance to practice a skill that everyone told me I should practice for interviews which is the elevator pitch. And I didn't it's sort of like learning algebra when you're 12 you think you're never going to have to do that. I was like, sure enough as I was reminding Eric, I did it up with an elevator with him at some point, and hurriedly gave my elevator pitch. So I got to use it but that that was the last time I was in this building I think. So it's great to be here. My lab works at the intersection of technology development particularly in genomics, sort of computational approaches and protein science. It's a great fortune in the last 10 years to go really from a very basic research program to to one that is increasingly translational, and also focused on trying to implement some of the translational work that we've done or at least be partners and implementation that's been a really fun journey for me I'm going to try to share it with you as well as tell you about a little bit of the technology development work that we're doing to further our overall goal, which is trying to understand genetic data scale that's compatible with the size of the problem. And the size of the problem is huge this is not news to those of you here at NHGRI, but we know just based on the human mutation rate that if we were to sequence everybody on the planet, we would observe every single nucleotide variant on the order of 50 times or so. And of course, there are lots of other variants that are, you know, bigger and more complex and we'd find lots of them to find lots of them too. That will eventually encounter nearly every possible single nucleotide variant and we'll encounter it a few times, but, but not enough times to do all the association based genetics that that we'd like to figure out what those variants do. And this problem, this problem of understanding variants at scale is really shown up already in the clinic if we look in a database like ClinVar, which is what this data is from. The variants that have occurred in a clinically relevant part of the genome, often as a consequence of a, thank you, as a consequence of a clinical genetic test, where a clinician or medical geneticist maybe has looked at the variant and all the information they have about the patient, their phenotype, maybe a pedigree if that's available, the variants frequency and population and so on and so forth, and tried to make a determination of whether the variant is pathogenic or benign. In fact, in about 75% of cases for 75% of variants, the determination is actually something called a variant of uncertainty. That's where there's not enough information to say whether the variant is pathogenic or benign. And this represents sort of a dead end. And I would argue, you know, a key limiter for genetics in medicine, right, because if you can't figure out whether a variant is pathogenic or benign and it's difficult, it's impossible to take action in terms of, you know, treatment or to provide an accurate diagnosis. And so the last thing I'll say here is just that this problem is a variant of uncertainty problem is growing rapidly, it's the gray line on the left there. And the reason is that as we sequence more and more people, we find more and more rare variants, we don't have information about these rare variants so they get interpreted as variants of uncertainty. And so that's one of the key problems that my lab has grappled with. And so one type of information that you can in principle generate for every variant or any variant is experimental assay results right so you can take a variant and clone it and express a protein and study the very protein if the variant is in a protein coding region of the genome, or you can have a variant cell line or animal model and compare the variant to a reference sequence and make some conclusion about whether the variant causes a loss of function or gain of function or it looks like looks like the reference. And that's obviously been done for a long, long time and it works really, really well, but the problem that this this process sort of face is a scaling problem so how do we go from being able to test him full of variants in the lab and experiment to the scale of the millions or billions of variants that I argue will eventually want to evaluate experimentally. And as Mary so kindly pointed out, that's a problem that I've worked on and contributed solutions to along with many others in the form of multiplexed assays a variant a factor of which mutational scans are one to mutational scans are one. And the idea there is that variants can be assessed together in a pooled format. Using high to put any sequencing as a readout and so the way that works if you're not familiar with this type of assay is you start with a pool of variants a library of variants and in this talk will be talking about using cultured human cells reach one of these little circles represents the expression is just one variant, and then you subject that that pool to some sort of selection and maybe the easiest selection to imagine and think about as a growth assay where you've rigged up the, the cells such that they their growth depends on the, the functional capacity of a variant that they express. So in this example of a blue cells express a non functional variant they drop out of the population, and then you use high to put DNA sequencing to count the variant frequency before and after selection. So we can make a variant effect map like the one that's shown here, where in this little snippet of a map in a protein coding region of the genome every column is a position, every row is an amino acid substitution, and the blue tile means that the white means that the variant is has reference sequence like function, and red means that the variant is a gain of function variant. And so, so that's a quick primer on multiplex assays a variant effect and I would say that my lab along with a community of other labs has made out a first generation of key enabling technologies in this space with NHGRI support. And so we now have a set of assays that can measure the effects of 10s of thousands or hundreds of thousands of variants at one time for phenotypes, fairly simple phenotypes like cell growth, or protein ligand binding, protein abundance, fluorescent reporters, some single cell assays and also things like reporter transactivation like all MPRA style so we have this first generation of technologies, and they've really begun to be widely adopted and deployed. So this is just a graph that I put together for this talk. And what you can see is that over at this point about 11 million single nucleotide variants have been assayed in one of these multiplexed assays of variant effect. And what I think that means is that we're at a really interesting point in in this in this sort of journey of trying to understand variant effects in a comprehensive way experimentally because we now have enough data for enough genes to really understand what these data are useful for how they can be useful, where they're able to be able to compare different types of assays, and to be able to chart a course from where we are now to where I would argue want to be, which is having a functional understanding for like every variant in every base at least of the disease related genome. So let's talk about what these data can do. And I think the first the first thing that we are interested in answering is you know do these these variant functional data deliver on the promise of helping to understand variants in the clinics that's the first thing I'm going to tell you about and then after this I'll tell you a little bit about what new technologies we view as important to develop and I'll go through one of those that we've worked on in my lab. So, along with a really close collaborator and close friend leis terita, we executed a project where we curated functional data from multiplexed assays for about 20,000 variants and three key cancer risk genes so these are genes where if you carry a germline variant, you're at a greatly increased a pathogenic germline variant, you're at a greatly increased risk for developing one or more types of cancer. And one of those data sets is is shown here for BRCA one. And what you're looking at is just a distribution histogram of the functional scores just of control variants so known pathogenic variants or known benign variants that were in this assay. And you can see that the functional scores the functional assay cleanly separates the known pathogenic from the known benign variants that's one, one of the assays that we that we looked at and there are 4000 variant effects that were measured in that in that in that assay. The other two are for TP 53 and P 10. And we'll talk about their complexities a bit as we as we go through this part of the talk but we had four data sets from two different labs for TP 53 probing two different mechanisms a loss of function mechanism and a dominant negative mechanism. And then for P 10, we had two different data sets from two different labs, one focusing on the variance abundance on the protein level, and also on the foster, foster taste activity of P 10 variants and so this kind of represented a real world cross section of data sets that different labs that generated at different levels. And we wanted to know whether we can use these data to reinterpret variants of uncertain significance in these three genes. And so to do that we partnered with Ambre genetics, a large clinical diagnostics company, and went through the following process so we did a little bit of curation with the data sets that that I just told you about. And then we went through an exercise that is required to use evidence in interpreting, interpreting a genetic, a genetic variant, and that process is laid out has been laid out by like clean Gen and basically, it's sort of following the rules that you have to follow to arrive at a conclusion a data driven conclusion or whether a variant is is pathogenic or benign, or whether you don't have enough evidence to tell. And the key thing that you have to do to use a piece of functional evidence is to determine how much you can use that information that that functional evidence is giving you for the variant being pathogenic or the variant being benign. And that breaks out into these sort of categories so there's the there's the evidence can be either strong, moderate or supporting or you can push your variant in the pathogenic direction or the benign direction a lot, a moderate amount or a little bit. And the way that that evidence gets assigned these different categories is by is by evaluating how well does the assay does the functional assay separate the benign control variants from the pathogenic control variants so an assay separates those variants perfectly then it will have strong evidence less so if it separates them less perfectly it might get moderate evidence and so forth so we went through that process for each assays. And then, once we had those evidence strengths and could apply that the evidence to the variants of uncertainty significance for each of these genes. So that's what we did. Okay. So, so here's the BRCA one data again, as I said this is an example of a functional assay that basically perfectly separates a whole bunch of control variants. And as such, it gets strong evidence so if you're a variant if you have a variant that scores in the sort of pathogenic region of the functional score range, then that would be strong evidence of that variant being pathogenic and vice versa if a variant scores in the benign range of the assay. So, with, with this exercise completed, we could go to the set of variants of uncertainty significance that were in the Ambrie database, and we could just ask okay now adding in the functional evidence to all the other types of information you have about each of these variants, how many of them can we reclassify. And the answer is that we could reclassify about 50% of the of the BRCA one variants of uncertainty significance. And this was a really exciting result for us right because it said that okay. We generated this data took a lot of effort, we claim that it's useful and now is show quantitatively that it could resolve half of the variants of uncertainty significance in in this protein and I think, or in this gene and I think it's important to point out to just pause and like point out and reflect that each one of these variants are someone carries that variant right and the result that they got previously from their genetic test was like, well, this result is inclusive we can't we can't say anything about this variant and you know in this case we returned these results to the physicians right and so now that person gets at least the ones who were classified get some more definitive classification which is either one of like you don't have to worry, or, you know, maybe you want to take some action on the basis of, of the variant that you carry. So for TP 53, the case was a little bit more complicated. There, as I said there were four different data sets and you'll notice that the blue and red is like mixed in these in these data sets are not none of the assays on their own separate variant and the pathogenic and benign variants perfectly. So we use a naive based classifier to combine these four data sets and evaluating the performance of that classifier on data, you know, some test variants that we had held out. And what we determined was that our classifier based on this data gave strong evidence of pathogenicity and moderate evidence of a variant being benign. We could then repeat the same exercise for BRC or for TP 53, and we found there that we could reclassify 70% of the variants of uncertain significance using using this, this, this classifier and I'll just remind you, although the numbers that were in every database here are fairly simple, we have functional evidence for all the variants basically that have been seen that are already in ClinBar, as well as pretty much all the variants that will ever be seen in TP 53. So, so we thought that was pretty powerful. For P 10, you know this is a this is a gene highlight some of the real challenges with this approach that we've that I've just articulated for you. So the thing you'll notice about these functional score distributions for P 10 is that there are no blue variants there are no essentially no benign variants that are known in this gene, and or that have been classified in this gene. And you know there are a few high frequency higher frequency control variants that you could use but even those are very limited in number. And so what that means is that this this framework that I told you about that Clinton Jen has articulated and is so nice for calculating evidence strength totally breaks down and you can't really do much with it. So that's something we can talk about more as a challenge moving forward as a challenge for I would argue all of clinical genetics but we were lucky in that a variant creation expert panel so a group of folks who really think hard about P 10 and how to classify variants and P 10 had already looked at this data and decided how this functional data and decide on decided on how it should be used in for very interpretation so we were able to follow their rules and classify about 15% of the variants of uncertainty and significance for P 10. Yeah, why are there so many benign variants and even a silent position. Yeah, well so we only look at this sense variants. We only look at this. We only look at this sense variants but yeah, the P 10 is just incredibly intolerant to variation and my impression and I'll speak, not as a clinician but my impression is that clinicians are also a little bit hesitant to call benign variation in P 10 that what I've learned is that there's a lot of heterogeneity across different genes in terms of what what a clinician's propensity or comfort level is with calling something benign. And so, for P 10 there just aren't, there aren't clinically classified benign variants accepted to that are there. But that's a good, that's a good place to pause and take questions because I'm kind of done with this section so do people have other other questions about this. Yeah, so it's interesting to BRC when you have a very fair cut no functional versus benign. I mean, I'm normal, but only like 15% of what we classify where as well keep you 53 you'll have ambiguity but you have a higher percentage of reclassification how do you explain that. So the question, I will repeat for the people online was just that in the case of BRC a while in the functional assay seems to separate benign and pathogenic variants, basically perfectly, whereas for TP 53 it's less perfect. In the case of BRC a while and we were able to reclassify 40% of their 50% of variance but for TP 53 70% so why, why that disconnect. And the answer is that, you know, in addition to the functional data that we're looking at here. We're also looking at all the other evidence that's available for each variant right and that comes from like the family history. It comes from, you know, whether the variant occurs at a position where like other pathogenic variants occur. There's a whole sort of list laundry list of information that's used for reclassification. And the fact of the matter is that for these patients in this database. There was just more many more of the TP 53 variants were closer they had, you know, they were either closer to being pathogenic or closer to being benign, and thus could be like pushed in one direction or another by the adding of the functional evidence than for BRCA one but it's like really parochial to each gene, in terms of like what what evidence was already available for patients does that make sense. Other questions. Yes, one online. Oh great. Thank you. This is from Gary. Can the effects of two or more variants in one allele in a patient be evaluated as well. That is a fantastic question and something we're thinking hard about. So the question really reduces to like, can these assays model genetic context, right, beyond just a single variant. And I think the answer is that at a, at a single locus right now, that's possible right it's possible to envision making, for example, combining the set of all possible variants with each of a few common variants that are found at that locus. It's also possible to think about. Excuse me combining the set of all possible variants at a locus with maybe one or two or a handful of say gene deletions. The group is working hard and others at University of Washington are working really hard to give a more satisfying experimental answer to that question by being able to model variance in diverse genetic context right so like one idea for doing that would be to take a type of variant library and examine it in a whole bunch of different IPS, you know patient IPS derived differentiated cell types like that would be, that would be one example or excuse me to be able to combine a variant library like the one I've talked about with a CRISPR library that looks at maybe hundreds of different gene deletions or over expressions. So the answer is right now, we're a little bit limited but I think this is an important and active area of technology development. Okay, great. So, the other thing that I wanted to say about using the data in the clinic is just an additional little story about p 10 on the somatic side. So I mentioned that there were two p 10 data sets, an activity a phosphatase activity data set and an abundance score and you can see how all the variants lay out between those two assays here, and we were able to use these data to classify every variant as being like either a lot total loss loss of abundance and activity loss of one of those two features or a reference like that. And we could start to look at how each of these different classes of variants distributed over different populations of individuals like, you know from unaffected folks or people in ClinVar autism spectrum disorder or the p 10 tumor syndrome that is sort of the classic syndrome that you have if you have a germline pathogenic variant p 10. And what you see is that the mechanism of a variance impact, whether it causes loss of abundance or loss of activity is not distributed equally across these populations and it sort of begins to give us the sense that having mechanistic data, like these assays can provide useful in thinking about how variants exert their effect and cause disease. So we did an analysis of a bunch of variants that we that we found in cancer. And so what you can see here is just a plot of the frequency of variants of different of the different types from different cancers. And the dark green line is just sort of the null model of how frequently you would expect to find variants in each class so wild type like loss of abundance and so on. So frequently you expect to find variants of that type in in cancer if there were no selection at all right and in other words what's the probability of generating each type of mutation just from given the gene and a simple model of mutagenesis. And what you can see here is clearly the effects of selection right there are far fewer wild type like and loss of abundance only variants, then you would expect in cancer but far more variants that wipe out both functions, or that are loss of activity. That also was really interesting but it led us to be able to dig a little bit deeper. And it turns out that it turns out that in p 10. There are lots of activity variants that are known to act in a dominant negative fashion p 10 can function as a timer. And that but that implied us is that maybe some of these other loss of activity variants that we were able to identify could also be dominant negative variants. And we were able to follow that up and this is just a couple of Western blots showing you here using fossil AKT reporter so p 10 defrost correlates AKT, then you can see that some of the variants that are loss of activity act in a dominant negative fashion just like we hypothesized on the right hand side there you can see that they have way higher levels of fossil AKT, then, then you'd expect, because they're there they're dominant negative so again just showing this data to try to give you a sense of how large scale functional data can can sort of point you in the right direction, mechanistically, in addition to just saying okay well this is a pathogenic and this is a benign variant. All right. So, hopefully that sort of section of the talk convinced you that the that these data that are being generated have a lot of utility in the clinic and so what we're thinking about now is, what are the technologies that are missing that we need to do to go from where we are now with a few data sets that cover a few hundred, a few hundred genes, and represent, you know, a small percentage of the genes that where the data would be useful to maybe having data for all the clinical genes in the genome and maybe for more than one assay and maybe for all the regulatory regions as well. And I would argue that there are four main areas to think about. One is scale right we need to be able to generate more of this data more cheaply. The other is context, right so I told you about tumor suppressors all these tumor suppressors are involved in DNA repair that's a critical function for every cell, and context is maybe less important, but there are plenty of other genes that encode for example proteins that are only expressed in certain cell types like say cardiomyocytes, right and if you were to try to do an assay on them and a generic human cell line you wouldn't get much useful because all of the structures that are present in a cardiomyocyte are not present in that cell line. There's also environmental context and we talked about the importance of genetic context and the need for technology there. There are of course many other types of variants between besides single nucleotide variants that we would like to model large and small deletions insertions translocations and so on and so forth. And then the last problem is that there are diverse functional genome right promoters enhancers genes of different types, and they're not all amenable to the same types of assays and technologies and I'm going to tell you a little bit more use the last 15 minutes or so of the talk to tell you about trying to extend functional assays to additional, additional functional elements in particular genes that encode secreted proteins. So, excuse me, excuse me, so a lot of the assays that we've developed so far function of focus on genes that encode set of plasma proteins and all the data that I showed you previously are genes of that type. And if you think about trying to use one of these assays that that we've that we've developed on secreted protein, you can see that it wouldn't work right because if you have your cell and it's encoding some secreted protein. What's going to happen is that that protein will be secreted and it will be lost to the media, and there's no way to do any kind of assay that recovers the sequence of the DNA that encoded that particular secreted. And so, and so, you know, this is a problem that is is a significant one about 10% of the of the genes in the genome encode a secreted protein, and about half of those have been have been linked with disease in some way so it's not inconsiderable. There are other good reasons to think about secreted proteins and fun ones. The one that I'm currently like most excited about is that, you know, like cell and gene therapies are thin, and many of them deal with secreted proteins. And so being able to better understand and engineer these proteins is also a compelling reason to try to study them, but I'm not going to talk too much about that today. So, we developed a method that we call multi step or a multiplex surface tethering of extracellular proteins to be able to assay variants in secreted proteins at scale. And if you're familiar with protein display technologies, this is not that different. In the sense that we have a cell, it expresses a secreted protein, but that secreted protein is fused to linker an epitope tag that will become important later, and then also a transmembrane domain that anchors the secreted protein on the surface of the cell right so now we have each cell displays the protein variant that it encodes and so we sort of reforged the genotype phenotype link here. We've reforged the genotype phenotype link here, which is the key for these types of approaches. So why do we bother developing another version of protein display? Well, most existing display technologies use, they're sort of focused on displaying interest out of the proteins, ironically, and they are mostly in yeast and bacteria and those organisms, while great, don't have the post translational processing machinery that's required for a proper maturation of most human secreted proteins. That includes post translational modifications like recosalation and phosphorylation, but also protease cleavage and the like. And so we built this system to basically deal with all of those issues, and we chose to apply it in the context of hemophilia B and the gene factor 9. And so as many of you probably know, hemophilia B is a coagulation disorder and it's caused by variation in the factor 9 gene. It was interesting to us because it has a fairly high de novo rate, it's excellent, and there's a large, fairly large patient population that has been really deeply phenotype. So there's a lot of clinical data about these individuals, which we were interested to be able to compare the results of our functional data or our functional data too. Additionally, factor 9 variants in factor 9 are most often mis-sense variants, and we already know quite a bit about the mechanisms by which variants can cause disease. So we know that some factor 9 variants cause a defect in secretion, others cause a defect in post translational modification, and a few others cause loss of either enzymatic activity, factor 9 is the serine protease, or binding to critical partner proteins and the like. And so we were really interested in seeing if we could use our method to develop a mechanistic understanding of how variation in factor 9 cause disease. And of course the big one, which is true for many genes, is that the vast majority of variants in factor 9 are functionally uncharacterized. We don't have data about them. And many of the variants in this gene are variants of uncertain significance. Okay, well, everything came through on the slides except this, so the shading here didn't come through, so apologies for that, but hopefully you can still see these curves. So what you see here is just a density plot of cells that have displayed, either displayed factor 9 in blue or displayed nothing in red, stained with an antibody to that epitope tag that I told you about in our display system. What you see is that the cells that express factor 9 have a nice high signal in that epitope tag channel. Additionally, there were several other antibodies that had been raised against different parts of factor 9. And each one of those antibodies gave us a nice strong signal and what that told us was that we had factor 9 displayed on the surface. And here we know from these data that factor 9 are folded because several of these antibodies are confirmation specific antibodies that have been very well characterized. And so this gave us confidence that our system was capable of displaying full-length, intact protein on the surface, which is what we'd hoped. And so, you know, from here, it's just a hop, skip, and a jump to a multiplex assay, right, because we can take a library of factor 9 missense variants, which is what you see in blue here. We have about 10,000 variants in this library. We can stain that library with one of the antibodies I just showed you, and we can sort that library into bins, right, ranging from low antibody binding over here on the left to high antibody binding on the right, and that corresponds to like lots of factor 9 on the surface, to little factor 9 on the surface, to lots of factor 9 on the surface. And then we can take those sorted cells. We can deeply sequence each of the bins and compute a score, a secretion score that reflects essentially each variant's capacity to be secreted from cells. And so this is a secretion map for factor 9. It's just exactly similar to the one that I told you about in the sense that blue means that those are variants that are not secreted well from cells. There's a lot of interesting stuff that there is to see, there's a lot of interesting stuff to see in this map, and I'll just draw your attention to just one thing, which is this kind of blue stripe, or sorry, the blue stripe which is boxed in red. And that corresponds to the fact that lots and lots of cysteine variants in this protein are deleterious to its secretion. And in fact, not only are mutations to cysteine deleterious, but also mutations from the many cysteine residues in this protein are also really deleterious to its secretion. And, you know, this is unusual. So if you look at the effects of cysteine variation on factor 9, you can see that, you know, that's just an average on the right hand side there. The cysteine variants generally speaking like wipe out the, sorry, making mutations at cysteine residues wipes out the abundance of this protein. And that's not true for like many other set of plasma proteins that we've studied using similar assays of a variant abundance. And this, we think this makes sense because that factor 9 is a protein that's held together by many disulfide bonds. And so, you know, the thing that I think was not super surprising was that wiping out one of the one of the cysteine residues involved in a disulfide bond messes up the folding of the protein and thus its secretion that's maybe not so surprising. What's interesting to us and what we're sort of trying to run down now is that introducing a new cysteine residue somewhere in the protein also seems to mess up folding and thus secretion. And so we think that there's an interesting story there that likely relates to the sort of thermodynamic stability of factor 9 and how it folds, and we're working on trying to really understand that, but it's a cool thing that came out of that that came out of the data. Okay, so I mentioned post translational modifications, and I should say that nobody's been able to do a multiplexed assay, a variant of facts for a post translational modification but we thought we were in a pretty good position to do one because factor 9 is extensively post translationally modified and what we're seeing here is just how it's consolidated in many places and cleaved by a pretty a couple a couple different proteases actually. And additionally, on its end terminus, it has this very interesting post translational modification where it's carboxylated and on a set of glutamic acid residues that, like I said reside in the end terminus of the protein in this so-called And these glutamic acid residues are critical for the function of factor 9, it mediates binding to a bunch of different partner proteins, the phospholipid membrane, and if we know because there are pathogenic variants at some of these glutamic acid residues, if you're not properly carboxylated then you're not functional. Okay. So, here's a structure of this glottomane and you can see these calcium ions here and the purpose of all these glutamic acid residues that get post translationally modified, which are shown in as sticks is to bind these calcium residues and kind of hold the domain of the protein together and make it fold. There are also some residues that are post translationally modified that you can see that bind magnesium, the gold spheres, and the function of those residues is less clear. But at the time that we started this work is that I didn't know which of these residues was functionally important and which ones and which ones weren't. In addition to the antibodies that I told you about that just sort of bind to different parts of the protein, there are also two carboxylation sensitive antibodies that are available for factor 9. And you can see the varying effect maps for the two, those two carboxylation sensitive antibodies on the top, and then the sort of secretion tag that the the the epitope tag side that just tells you about secretion. And so it's kind of a complicated story. It's kind of a complicated story, but what we learned was that this this top antibody really sees the overall fold of the glottomane and bottom antibody sees specific post translational modifications, specific carboxylation events within within within this domain. And what we were able to do with this data is basically work out and you can see the in blue on the structure here, the residues that are very sensitive to to to where variation causes a loss of binding to this carboxylation fold sensitive antibody. But what we were able to learn by analyzing by analyzing these data is basically which of these residues is important for the fold and function of this part of the protein, and which are not so important and it turns out that most of the calcium residue coordinating, coordinating calcium are really important. And most of the magnesium coordinating residues are not so important, much less important. And so. And so that's a cool, you know, that's sort of a cool biochemical story. We were also able to compare the data to like I said this wealth of patient data that we had for hemophilia be hemophilia be patients. And so this is a plot of the secretion score and our assay on the law on the x axis, and levels of patient plasma factor nine this is called antigen level but basically just how much factor nine does the patient express on the y axis. And what you can see is that the correlation is quite good and in particular, if you have a variant that scores low in our in our secretion assay, you're almost guaranteed to find that very at low levels in in patients. Conversely, if you are a variant that scores highly in our assay, it's really probable that you'll also be found at high levels in patients but there's much more variability here. And what's kind of neat is that these variants down here, which are ones that are secreted well in our assay but are not found at high levels in patients, are actually variants that occur mostly on like the surface loops of the protein. And what we think's going on here are that these variants are sensitizing factor nine to cleavage by proteases in the plasma. In our assay, but they are, they are present at low levels in patients. And so what we're working on now is a refinement of our assay to basically treat treat cells with plasma proteases and discover, you know which variants predispose to plasma plasma protease proteolysis. Additionally, and lastly this is I think the last data slide I have. Hemophilia B patients are classified into three different classes of severity, ranging from severe to mild, and our data actually pretty cleanly not perfectly but pretty cleanly cleave the population of patients into these categories of severity, or said if you score really low in our secretion assay, there's a very high probability that you have severe disease. And conversely, if you score fairly well in our assay if you're a well secreted variant there's a much lower probability that you have severe disease. And so, you know this is important for like the hemophilia B community because it's useful to know, you know, you can imagine situation useful to know what variants will cause will produce different levels of different levels of severity of a disease. Okay, I lied, there's two more slides. So the last thing that we did was basically ask, okay, can we like we did for TP 53 if you remember 20 minutes ago, can we take all the data that we've generated so what we did was, you know, use five different antibodies. So 50,000 variant effects scores five for each of the 10,000 variants and in factor nine. Can we tell that data together, again using machine learning to make a predictor of loss of function or normal function for each variant, and then you know trained on on clinical variants of known effect, can we then predict effectively variants that we haven't seen and the answer is, yep, we can get an exquisitely spec specific prediction. That's only partially sensitive. And you can see some of the that you can see the results of our of our test set there, where we've predicted, you can see the predictions and then also the feature values, but the bottom line is that this this model learns that if you score poorly in any of our assays, you should be called pathogenic. If you score well in any of our assays, you're probably benign, but we're less certain about that and that makes sense if you think about what we measured, right, because we don't capture every possible function of factor nine and in these two assays. So there are ways you can be pathogenic that don't correspond either secretion or post translational modification, but if you can't be secreted, or you can't be post translation and modified correctly, then you're pretty much guaranteed to be patching. And so we were able to repeat the same exercise that we started with for for for the factor nine gene, and that is to try to reclassify variants of uncertain significance and this is preliminary. All this is unpublished but this is preliminary work. So these numbers aren't totally set in stone, but the bottom line is that we think we can reclassify about 50% of the variants in factor nine using using this data, using this data. Okay, so then, lastly, we've shown that that this multi step technology can be applied to many different genes, you know, genes encoding secreted proteins. We think there are good reasons to study variants in many of these genes. And, and, and that's sort of where we're headed so we want to expand multi step to apply it to many more proteins, we think we can use it to measure other functional features of of of variants including like how they function and how they bind to partner proteins. And then like I said I'm really excited to use this data to try to improve recombinant protein gene therapy and cell therapy products that depend on secreted proteins. And I tried to convince you a little bit that we can learn about that we can learn about the mechanism by which variants exert their effect in in hemophilia be and we think that there are some important clinical phenotypes that correlate with with with the molecular mechanisms that we've measured, and that those may be useful to clinicians trying to treat patients in this population. Okay, so just to wrap up the vision that that I have is that we should generate functional data for essentially every variant in every disease in every at least disease related gene in the human genome we have the technology to do a lot of that but not all of it as I tried to convince you earlier. And I think that we should consider generating such such such functional data, not just for one phenotype like say protein secretion, but in fact for many layers of phenotype right ranging from the molecular like how do variants impact RNA and proteins to the cellular, what happens to cells, and also the sort of organismal level like what happens in development and in tissues. And so we're working on sort of building a full stack of technologies to span this range of phenotypes and get us to the scale that we need to accomplish this goal. I also want to say that, you know, this kind of effort is already and will need to continue to be a community effort. So, you know, on our part, we've developed a database may be that houses a lot of the variant functional data that's been generated there are about 300 data sets in the state of a so about 3 million varying effect measurements. There's obviously much more work to do there were federating the state of a so connecting it both to ClinVar Uniprot nomad and a bunch of other community resources. That's I think really important work, but perhaps more importantly, generating all this data will take a lot of time, and also a lot of people in many different labs with many different levels, you know, sort of types of expertise. And so we found it along with a couple of other, a couple of other folks this this organization called the Atlas of Variant Effects Alliance, that's got about 400 people in it spans 25 countries and it's a group of, you know, everybody from clinicians to tech dev people to new researchers to industry folks. So focused on realizing this this Atlas this sort of complete set of variant effect measurements for the human genome and then NHGRI has spun up the impact of genomic variation and function consortium that has 24 awards and a bunch of people and also is is sort of at least interested in trying to understand variant effects at scale in the human genome. And so it's really exciting time, as I said, because I think we have a lot of the tools to kind of make a first draft of a comprehensive set of variant functional measurements, and we've done enough to show that that type of data will be really really useful resource, both in the clinic and on the basic science side. So with that, I'll just thank the people who who did the work, they are, they are many. And as I said before, the collaboration, the clinical variant reinterpretation stuff that I showed you as a collaboration between my lab, Lea's to read his lab and Ambrae genetics, Sean fair is the student was a genetic counselor actually before and is now a PhD student in Leah and my labs who did that work. And he, he has had some help with from a few folks including Abby McEwen. And then the the factor nine protein display stuff is the brainchild of Nick pop and MD PhD students right there. He just defended his thesis and is going back to medical school so I'm sad to lose him but he did amazing stuff. I'm Rachel Powell, raining Wang and a few others in the lab really supported supported his work. So with that, I'll say thanks to especially an HDRI but other institutes for funding. And a lot of this work particularly the technology development stuff which is hard to get funded otherwise, and all of you for your attention. Thanks for an excellent talk. And do we have any questions, Sean. So, so the factor night were you surprised at all that you weren't seeing clustering towards the signal peptide that sends it out to the surface. Do you know that it's not being secreted versus stability of the protein once it's outside the cell. Yeah, those are both good questions. Let me take the second one first so that was you know do we know that it's stability. In other words, can we differentiate between folded and present on the surface versus folded and unfolded and not present on the surface. So we think we can because you know I mentioned we have this panel of antibodies. Some of them see linear epitopes and some of them see folded epitopes, and they basically give the same answer. So we think what that means is that if you're unfolded, you are, you are not secreted. And if you get out your, your folded. Oh, like in the plasma. It could yeah and we don't we don't we don't see that, or we don't we don't have sensitivity to that in our assay because there's none of the sort of stuff in plasma that you know, I mean these cells are actually cultured in like a serum free media so there's not even a little bit of protease there. So it's what explains some of the difference between like patient plasma levels and what we see in our assay, but like I explained I think we have. I think there are good prospects for, you know, assaying that if it's something we want to do. And then your second question just give me one word to remind me the signal. Oh yeah. So the signal peptide isn't isn't is an interesting beast. We are using the native factor and signal peptide and we do see that it's more mutationally tolerant than other parts of the protein. The critical the critical functional regions of the signal peptide, we do see are sensitive to mutagenesis, and then regions where there are not known signal peptide motifs are are pretty, you know, not sensitive to mutations. So I guess this is the first mutational scan of a signal peptide we weren't exactly sure what we would see but it turns out it's pretty mutationally tolerant and I think said another way that jives with what we know from alignments of signal peptides. We are doing a project right now. Rachel, who I mentioned is doing a project right now where we're looking at I think 10 or so libraries just have signal peptides drawn from different proteins. And so we'll be interested to see if we can buy comparative analysis learn more about exactly what makes a functional signal peptide and, you know, build a model to more accurately annotate signal peptides. But that's what I have to say about that. That was a great talk love the work. Two questions, very unrelated, right, I'm slightly bothered by the answer to an earlier question when you sort of said about the interpretation that came with some of the variance that classified him in a certain way and a little bit of a cautiousness for one of the gene. And I'm just trying to just I couldn't get my head around that to think how often do you end up relying on diagnostic findings. When you're when you're comparing results your assay and if you ever have a way to get a more agnostic approach to interpreting wrong because I could imagine there is a lot of sociology involved in making some of these clinical calls. And I could sort of interfere with how you interpret some of your studies and I could vary from gene to gene. Yeah. No, this is a this is a big problem right it's in it's sort of a problematic and all sorts of directions there's biased missingness and clenvar right so that's problem. There's error in clenvar, and that's not uniform over genes, that's a problem. And then there's lots of genes for which there are no control variants, and that's an even bigger problem. So, you know we have strategies to try to deal with the first two of those right and not involves like trying to curate what's in clenvar to not use every variant but only ones that have been recently classified from labs that we trust and so on and so forth that the community trust I should say. Really your your question is is is insightful and gets to a much bigger problem that is not parochial just a functional assays but in fact to like all of clinical genetics, which is, you know, in the absence of like do we send are we stuck with just what's in clenvar right and I think our thinking about that is that we would like to leverage what's in biobanks right to try to begin to get larger truth sets to assess our assays and I know lots of other people are thinking in that direction in lots of spaces, but that's that sort of the direction we would like to go. We also, you know, to sanity check our results often use some of the best variant effect predict computational variant effect predictors because although they're not perfectly accurate. They generally get things right so if we see big discordance like that's a sign that things are not got I'm not going well for us but in general, to generate expanded truth sets, what we what we are intending to do and working to do is relying on on on on biobanks but but to really mind all of us state the other day to you get your hands on the truth. Correct, right. But it's and we're and we're, you know, we realize that we probably won't. We may not like lead the world and doing that there are people whose whole research enterprise is focused on on doing that but I think that's where the where the solution to the problem has to come from because there just isn't doesn't my take is that there is no throughput to add, you know, to go through the ClinVar what the process that gets you into ClinVar for like every variant that can't be the solution right and so, and so, you know, we're sort of in the position, like I'm an outsider I'm not in ClinGen right I'm an technology developer and data generator and so, you know, we've, we've got some ideas about how to do this and we'll probably publish some research papers on it but what we're really hoping is that the people who control the process or at least give guidelines for the process like the folks at ClinGen and other analogous organizations articulate guidelines that are inclusive of data that's in that's in that's in biobanks right to develop So second question really unrelated. Is there any part of your research program looking at non coding variants on this sort of schedule sorts of assets. We generally don't look at non coding variants and I know I know that's, you know, maybe hurts your soul a bit but I think that I think that the, you know, the reason why, right, is that there's so much to do in coding regions of the genome like we'll get there. And hopefully we'll get there in five years like that's my hope right that we'll be able to say, yep, there are no more like variants in reportable genes in coding regions of reportable genes in like the simple cases like highly penetrant, not very highly expressive Mendelian genes where there's clearly a clear clinical actions to take. And that that's sort of, I guess that's my sort of stage one victory condition, there's a whole other side of this work that is sort of condition on like understanding common disease right where we think that non coding variation is also very important or is much more And, you know, there are great colleagues that are doing that work but it's not where I focus my efforts. Okay, let me ask that slightly, let me ask maybe a little more insightful way to phrase the question. Do you think the work you're describing technologically will evolve over time, and eventually yield our ability to use similar methods for understanding the non coding variants in the genome. Or are we going to wait for a whole new generation of technological innovations, a whole new whole new thinking. Thank you very much. Easy to run from the answer. So my personal opinion is that the current generation of technologies is going to struggle with non coding variation because context matters so much more for non coding variation than for then for some especially the kind of genes that I talked about in this talk right. But I think that the next generation of methods that we're working on, and others are working on so just for examples I mentioned one trying to port these multiplexed assays into IPS cells that where we can do you know cheap cheap and large scale experiments, but also like one of the things my group is working on is trying to multiplexed assays in multiplex assays in in like my developing model organisms right where we can actually have real different cell types interacting with each other. So if those technology development efforts bear fruit, then I think that they will be useful for going after a non coding variation so the answer sort of conditioned on the successes of these proximal efforts but I don't think. I don't think it's impossible to see that in a few years five years maybe or maybe less will have some technologies where we have enough context we're in enough of the right context that we can start answering questions about non coding variances scale. And that will be the business that I'll be in in five years because we'll have succeeded on the current issues now would be great. I'm just wondering regarding the factor nine story. Are you thinking about trafficking at all and how you could see the effects of mutations on trafficking. Yeah, so that's a great question. So thinking about it, you know, one thought is to combine the library where you have with some CRISPR based perturbations of secretion machinery. There are also some molecules that modulate secretion and proteostasis that we thought about using. So yeah it's a direction that we're thinking about and I actually think it's one of the key strengths of the specific approach that we use which is that it can be layered on with genetic perturbations at other low side pretty easily. Super would be super cool. Maybe I'll come back and talk about it in a few years. All right, we got a couple more questions on zoom. Oh, great. First is your thoughts on how to classify variants that are pathogenic and one ethnicity, but might not be in another. Yeah, so this is a really important question and gets to. So it connects to two questions. One question that was already asked. And one thing that I want to say. So the one question that was already asked was about net context, right. And if I think the questioner is asking about a variant where there's context that basically flips the meaning of the variant. I think that assays like the current generation of assays that we're doing will not succeed in because they're because there's essentially context free right there in some weed like human cell line, or maybe even their molecular assay. And that that's one of the main reasons that we're driving at developing methods where we can look in multiple contexts right that those contexts can capture, you know differences in in in in in populations across populations. So that's like one piece of the answer. It's important to say right I think the other piece is that one nice thing about functional data is that relative to this issue of like of population specific variants is that it is nice that we are testing all the variance right and so it's it's not the that's generated are are only for a set of variants that come from a certain population, you know one or the other, the map that we produce has information for essentially all various right. So I guess that's a strength but it is a weakness that we don't we don't look at genetic context and, you know, I guess what I would say is for genes where variants like that are known, we should be very cautious. You know, I've committed an error there and what we've done already I'd love for this question or to like get in touch with me so I can learn more. You know, and talk about it. Okay, and the second here this related but how long do you think it will be for us researchers us researchers to be able to shift from studying the impact of individual variants on a single somewhat arbitrary reference gene sequence. What will impact the variants on the appropriate haplotype background upon which it occurs with the patient. Basically getting a personalized medicine. Yeah, no no no I'll start counting stars in the sky and then we'll get there at the end, but no I mean I think I think it's it's it's obvious by inspection that we're not going to experimentally test every combination of variants that can occur, right even pair wise. And so I think, I think where we're headed right is that we get enough saturation data sets, and then enough sparser data sets in enough different genetic context that we can begin to make accurate inferences, right and that we don't have to do every experiment. And as to how long that takes. I don't know. I think that's really I mean I just, I could make an I could throw out a number let's say, let's say in three or four years I hope we have and I think we'll have saturation libraries in at least a few at least a handful say five to 10 key genetic context like in an IPS derived, you know, kind of cell experiment. So from there, I think based on the how, how good we are inferring across context will tell us how long it will take to really have a satisfying answer. All right, and then the final one here. Oh, actually we just got another. Um, all right so I've you tried this approach with PDL one, and could it predict sensitivity or lack of sensitivity to anti PDL one for immunotherapy. And anything with PDL one there's a really great resource built by my colleague in our in HDRI sags Fritz Roth called Maeve registry, and it's a resource where several hundred projects like this have been registered by the community and about half of them are registered pre publication. So the questioner should go to Maeve registry and look if PDL ones in there I wouldn't be surprised. If it is because I do think it could be useful for for looking at resistance impact I'm sure that it would be. So they should check Maeve registry. And, and if not maybe, maybe they should do the experiment. Awesome. And then the final one here. This question kind of ties into Eric's previous question about non coding variation. But how could we apply these methods to the epigenome. That's like a whole different ballax. I guess what I would say is that where we have, where we have sequencing based methods that can read out epigenetic marks, and where we have epigenetic mark writers that we can control. I think it is possible to do the kind of experiment I'm talking about here. It's fascinating. I haven't thought much about it. And the questioner should, you should get in touch and have a zoom call because that sounds fascinating. I think in principle it's possible again if you have a writer, and you can sequence the, you know that mark, then I think it might be possible. It would be possible to do so that's a great idea. Thank you so much. It's a really insightful set of sorry just over to one more. Okay, just one more quick. Yeah, I don't think that's all repeated. Okay. So really nice talk I really enjoyed it. I just want to get a sense of your, your, in terms of profiling the secret tone, how are there any limitations with regards to size like how small can you actually get because a lot of things like size, you know, especially things like cytokines and things like that could be quite small versus, you know, much larger I think factor nine is actually quite large. So are there any kind of technical limitations they are with the detection, how much coverage you think you're actually getting. I'm sure that there are many technical limitations. We've applied this method to on the order of 10, or so far the smallest one that we've tried is insulin, it did work that's pretty small and pretty highly processed so that gives me some hope. And, but, you know, it's kind of a method still in its infancy so I'm sure there will be lots of limits. But thanks so much. I really appreciate it.