 Okay, so first I'd just like to say thanks for giving me the opportunity to share the work that we've been doing in my lab here. I'm going to talk a bit about collecting large-scale mutagenesis data, which happens on the sort of wet lab side, and then some work we've been doing with that data to empower variant effect prediction. So, you know, we just had a really lovely talk from Daniel sort of highlighting how we can use, you know, population-level data to try to infer what a particular variant or a region in a gene might be doing relative to a phenotype that we care about, and in fact, you know, sort of genetics in general is concerned with trying to make inferences like this, and that's great, except that, of course, as we just saw, lots of variation that we find in genomes is rare or even private, and in that case we're in trouble if we want to try to interpret a specific variant, right, because it's likely that we've not seen that variant ever before. The scale of the problem when we're thinking about this on the sort of scale of, okay, using genomes in a clinical setting and a large number of patients is really extreme, right, because if we give ourselves a fairly reasonable per-base, per-generation mutation rate, then everybody carries something like 60 de novo variants and in the whole, sort of in the whole population of the world, right, there are something like 44 instances of every possible single nucleotide variant, and that's assuming that mutations are spread evenly across the genome, which is probably not true, but the point is that all the mutations that are out there to a first approximation are going to need to be dealt with somehow, right, and that's really challenging. So as our sort of Shroom 3 homework paper pointed out, a really effective way to do that can be using a model system to interrogate the variant that you're interested in. And so we've had some discussion, and I'm sure we'll have more about sort of what the right types of models are to use. But no matter what model system you're looking at, you run into a problem fairly quickly, which is that sequence space is really, really vast, right? So a typical human protein of 350 amino acids will have on the order of 7,000 possible amino acid exchanges. And so if you go into the business of testing variants that you see in a genome you sequence one by one, then you're going to be in that business for a really, really long time, probably long before the patients that you're trying to deal with need the information that you get out of your test. And so what we've been working on are a set of methods to make and test large numbers, large enough in fact to test all the possible single mutations in a gene at once. And we called this sort of technology deep mutational scanning, and it's one of the many sequencing or sort of phenotyping by sequencing methods that have come about in the last number of years. So you start with a coding sequence of interest. You make a library of mutations that would cover, for example, all possible single mutants in the protein you're interested in. You instantiate that library in a model system that you might want to use. And our lab works primarily with mammalian cells with edited genomes, although we also work in yeast and other models. Then once you've got your library all set up, you impose a selection for the function of the protein that you're interested in, and this could just be one functional assay, or in fact you could try to sort of phenotype at the molecular level deeply. That is capturing many, many different assays for molecular phenotype. In any case, variants that have the property you selected for, namely high functional capacity, will enrich, and ones that don't will deplete. And so then what you can do is use deep sequencing to capture the frequency of each variant across the selection. And then we've sort of developed some software and statistical methods to calculate a functional score for each of the thousands or tens of thousands of variants in the library relative to the wild type sequence. So what you get out of a deep mutational scan is essentially something like this. This is a way to organize the data, which we call a sequence function map that happens to be for SART kinase, which is the data set we've most recently generated. What you have here is just for a little snippet of SART kinase from position 219 to position 250. The functional consequences in this assay of all the possible, or at least many of the possible amino acid exchanges, which are the different rows, and you can see that some positions are tolerant and other positions are sensitive, and that's maybe what you would expect. We can do this now on the scale of whole, reasonably sized genes. So this is most of SART kinase. And it's these sequence function maps that I think will be particularly helpful if we're thinking about scaling functional assays to assist in interpreting variants that we find in genomes. In fact, tomorrow, Leah Sterita is going to talk about the work that she's done making sequence function maps for BRCA1 and then using those maps to train highly accurate models on variants of known effect to make accurate predictions for all the remaining variants. And so that's sort of an exciting mode in which we're working. And it works really well if you have a high value target like BRCA1 or other genes that we sort of really care about and can invest effort into right now. And although we're trying to develop technologies to scale this to hundreds or thousands of genes, that's not presently possible. So we're left with a challenge, which is we've got many disease associated proteins and we'd like to be able to learn about functional consequences of mutations in those proteins. And so that brings me to what I wanted to tell you about today, mostly which is prediction of variant effect. So these are just some variant effect predictors I'm sure you're familiar with most, if not all of them. Particularly in this crowd, they get pilloried for not being very accurate. They also get used in most papers that I read. So it's a bit of a love-hate relationship. But to me, that sort of represents an opportunity, right? Something that everybody uses but nobody really loves is maybe an opportunity. So we wanted to move in what we thought would be two useful directions. One is that many of these existing predictors either explicitly or implicitly seek to predict the effect of a variant on a human phenotype. And we thought that a sort of more narrowly scoped task, namely predicting the effect of a variant on a protein's function, might be, A, something we could be more successful with. And B, something that would leverage what we're learning from projects like EXAC and many others, genome-wide gene knockouts in model systems, about what happens when we lose function in a particular protein or domain relative to a phenotype we care about. Also, as was noted earlier in the meeting, nobody really knows what to do with activity enhancing variants. And we thought we might be able to make some headway there. So the reason why we thought we would be able to attack this problem is because we're sitting on a trove of large-scale mutagenesis data, deep mutagenesis data generated by our lab and others. And this is a bit out of date. It's more like 60,000 mutations or so now, single mutants, that is. And these data are deep, right? So we've profiled the effects of most mutations at every position in these proteins. So we thought from these data, we might be able to learn better rules for how mutations impact proteins. That being said, the dataset is not perfect for this crowd anyway, what we might like is just a set of human proteins, perhaps the most disease associated human proteins. What we actually have is a dataset of proteins coming from a variety of different organisms and assayed in different labs. Nevertheless, we annotated our training data as well as all possible mutations for every protein in Uniprot with a set of descriptive features that our model will use to make predictions. And these fall into three broad categories relating to the physical chemical features of the wild type of amino acid. Structural features, if we have a structure, and biological features principally adhering to site-specific conservation at the position where the mutation occurred. We used a decision tree-based machine learning algorithm to train a global regression model, meaning that we're trying to predict the actual magnitude of a mutation's effect relative to wild type. And this is just the performance of the model under cross-validation. I'm showing you this just to convince you that the features that we've chosen and the model training procedure we employed is a sensible one. So we can actually capture most of the variants in the training dataset with our procedure. You might be interested in what features are actually the most informative, and it turns out that site-specific conservation and structural information, namely solvent accessibility, are the most important features. And that's not surprising. That's what many others have found. So those sanity checks out of the way. What you really want to know is how the model performs on a dataset it hasn't seen at all before. So here I'm just showing you one example of leaving one dataset out for this regression model and training on the remaining six. I should mention that I'm talking about a version of this effort that uses seven proteins. Currently we're working with about 25, but seven is what we have finished and processed at this point. In any case, we get a regression r squared of about 0.6. And how I feel about this result depends on whether I wake up in a good mood or a bad mood. I think learning from six proteins, a rule set that can predict us the functional consequences of mutations in a seventh within r squared of 0.6 is pretty good. That's exciting to me. And we know from simulations that as we add more datasets, we're going to get better and better at this task. But as a biochemist, I'm not going to convince anyone to put down their pipettes just yet with this data. And so that's OK. I think we're on a good path. Nevertheless, we wanted to retreat to what we thought would be a slightly easier task. And that's predicting the categorical effects of mutations, right? Either damaging or not. And so what's nice about demutational scanning datasets is that in addition to getting the non-synonymous effects, which I'm showing you the distribution of on the left-hand side there, you also get the synonymous effects essentially for free. And as you would expect, that distribution is much narrower. And so we can use it to define an interval within which mutations we expect are damaging and then another interval within which we expect them to be like wild type or function-enhancing. So we can use this procedure to discretize all of our data and, lo and behold, most mutations are damaging. That's reasonable. We can then set about training classification models. So I'm showing you here the performance of a classification model trained leaving one data set out, the indicated data set out, with area under the ROC curve in yellow and model accuracy in green. And so we did that for all seven of the proteins in the data set. And this is what we get. Overall, reasonable accuracy, though some of the data sets are not predicted as well as others. And there are a couple of potential reasons for this. One is that these proteins when left out don't do well because they're just so different from the other proteins. I mean, some of these proteins are enzymes, some are not, some are big, some are small. The other reason is that some of these data sets are among the very first collected using deep mutational scanning. And like any technology, it's matured and gotten better. So we're working to incorporate some measures of data set quality into our model training procedure. So all that's a preamble to really answer the question, how well did we actually do at this task of predicting biochemical effects? And this is showing you what we call our best model, Envision, as compared to a few commonly used tools. And so we do quite a bit better. This comparison is one that really favors us because we're using deep mutational scanning data. And so you might ask, well, have you just learned the structure of these data sets and nothing else? So I can tell you that we've used another mutational database, PMD, which some of you might be familiar, has about 100,000 mutations in it called from the literature over many, many years. So it's a very messy data set. But what I can say is that when we compare Envision to any of these predictors, as well as some others, using the PMD data set, we beat the other predictors, sometimes by a large margin, sometimes by a small margin. But in every case, we do better. So we think we're on to something with Envision. And so that's good. One other thing we tried to do was look at pathogenicity predictions. And we weren't really sure what would happen here. Of course, we haven't trained Envision to do this. And so this is a comparison in ROC curve comparing Envision to some other predictors. And Envision doesn't do that well. This is ClinVar data on ClinVar data. And perhaps that's not surprising, because we Envision Envisioned, being paired with knowledge about what would happen when you lose function in a protein. We Envisioned it being used that way. Why I'm showing you this data is if you make a mixture model of Envision and the predictor that happened to perform best on this data in our hands, which was CAD, then that mixture model actually does a bit better than either predictor alone. So even in this very naive way of combining predictions of biochemical effect and predictions of pathogenicity, we improve things. We have ideas about how to do this in a much more nuanced way, which I'll touch on at the very end. So I told you at the beginning that one of our main motivations was predicting function enhancing mutations. And this is if you're sort of trained in biochemistry, this is an interesting question. Can you even do this? We tend to think about function enhancing mutations as being idiosyncratic, tuning up a binding interface, or something like that. And so we weren't really sure what would happen here. Nevertheless, we can reclassify our data using the procedure I told you about before as either function enhancing or other. And we have about 1,000 function enhancing mutations across the seven data sets. And so we were able to train a model to predict function enhancing mutations. And that actually works pretty well. And when we thought about, so why did this work, what we realized was that likely what we've been able to do is predict stabilizing mutations. That might be a general class of mutations that actually enhances at least the amount of protein around to perform whatever function we demanded in the selection. And so we think that's what's going on. We've done some validation of that assertion, which I just made, and it looks like it's true. We've also tested about 25 or so proteins that we called from the literature that were gain of activity variants in proto-oncogenes as well as some engineered proteins. And we find that we have a statistically significant ability to find function enhancing mutations. So we think this is a new and interesting thing that sort of variant effect predictors have not been able to do to date. So to conclude, I'll just say that I told you about sort of two extremes in which large scale mutagenesis data can be used. One, which I just sort of teased, and which Leah will talk quite a bit about tomorrow, is when we have a high value target where we really want to know about the protein in question, we can make a sequence function map and sort of do all the experimental work that it would need to do for that protein upfront, allowing us to, or at least giving us the opportunity to have that data when we see a new mutation in a genome. And then I also talked about the sort of other end of the spectrum, the fully global model that tries to learn from this data, a better predictor, a variant effect, and a sort of that we've done that a bit better than had been done before. What we really would like to do is build a model that's comprehensive and takes data sort of across this spectrum so that if you ask for a prediction for a protein that's been scanned already, you get a prediction that's gene specific. If you ask for a prediction for a homologue, you get a model that's built on the homologous protein and is thus pretty highly accurate. And finally, if you ask for a prediction for a protein that hasn't been scanned at all, then you get a general prediction. And of course, we'd also love to integrate what we can learn from approaches like what Daniel talked about to try to situate these predictions in the context of what we know about the likelihood of a mutation being in a particular gene or region of a gene actually being pathogenic. So with that, I'll just close by thanking the people who did the work, in particular Ethan Eiler, who's a graduate student who collected the SARC data set, and then Vanessa Gray, who spearheaded the Envision Project that I spent most of my time on in collaboration with Ron Haas and Jens Lubach in Jason Dury's group. And finally, Alan Rubin and Terry Speed, who've been instrumental in helping us with get the statistics just right for Enrich 2 and the data analysis. So if you have questions, I'm happy to answer them. Great. So we'll take five minutes for specific questions, and then we'll open for discussion. Yes, Dan. So Doug, thanks for that talk. Can you talk a little bit more about the assays and sort of how do you personalize them in the sense that each gene must have a different function, so a different assay or are there general rules or are there general assays that apply to large numbers of genes? Yeah, so that's a critical question. The work that we've done up till now has been on gene-specific assays, right? And that requires a fair bit of setup work. It's why you have to invest a fair amount of effort into producing one of these data sets. And so, for example, the SARC assay that I showed you or the SARC data that I showed you came from and a SARC activity-specific assay. In fact, all of them did. And that works great, but it takes time. We're currently, one of the things we're working hard on is trying to develop generalized assays for molecular phenotyping and cellular phenotyping. And so there, we're trying to leverage, essentially, flow-sortable assays that we can deploy for any protein. And those trade-on, for example, assessing the stability turnover rate, localization of a protein inside a cell, assessing measures of general cellular health, like, you know, intracellular pH, mitochondrial stability, time spent in each of the phases of the cell cycle, expression levels of particular marker proteins. And so what we'd like to do is build a panel of assays that we can use to construct, essentially, a matrix of variants by phenotypes, right? And then learning from that matrix what variants look like wild type and which ones don't, perhaps even picking out different clusters that differ from phenotype in similar ways, right? So that we can begin to break down sort of the sub-functions of the protein or regions that cause particular cellular or molecular phenotypes. So that's sort of how we're thinking about that problem. But that's one of the challenging pieces of doing this work. Other questions specific for Doug? Yes, Bob. So I wanna try to make sure I understand. So when we're talking about, at least Mendelian disorders, we're usually talking about non-redundant genes, which implies that they have something unique about them that's important. And so does that imply that you're never gonna get completely away from the protein-specific end or pathway-specific end if you wanna do it that way of the assays and just be able to use the global model? So it was a little hard to hear you. I'm not sure I fully understood your question, but I mean, we're really focused on molecular phenotype. I think there are questions that arise when you think about, obviously when you think about complex traits as opposed to Mendelian traits. And we've got some ideas in our beginning to work on making combinations of mutations at many loci in genomes to look at context dependence. But the type of assays that I told you about here won't speak to that. They really are good when you have knowledge about what loss of function or alteration of function of a particular protein will do for a phenotype when you've sort of given that information and want to know, okay, for a particular variant, what's happened to the protein in question. I don't know. So the global model doesn't, isn't really focused on the kinds of variants that will specifically mess up the unique function of a particular protein. I think I see what you're saying. So the global assays, no, they're focused on sort of, the idea behind them is to measure sort of general characteristics of proteins. We've thought about ways to get at something that's both general in method of measurement and specific relative to the activity of each protein. And what we've come up with so far is over-expressing the protein and looking at changes inside of their phenotype when over-expressed. But I don't think that's really a satisfying answer yet. I mean, I think we're sort of stuck in the general phenotypes that we can measure and then crafting protein-specific assays when we feel it's warranted. So last question for Douglas, specific Kathleen. Yeah, hi. I was interested in, okay, it stopped again. That's good. It's not gonna explode. The predictors of functioning, enhancing functional in the protein, especially with respect to maybe protein stabilization. So one obviously then kind of your mind kind of automatically goes to therapeutics. And so are you thinking about these protein stabilizations for therapeutic areas and how much stabilization can you really get from these? Yeah, well, I mean, as I tried to represent on that slide, we haven't yet fully, I mean, that's more or less a hypothesis that we see generally stabilization. We've made and tested a few variants, but I'm not gonna hang my hat on saying for sure that's the only mechanism that we're predicting there or that we're seeing there. It's worth noting that these data, sort of large-scale mutagenesis data can themselves be very useful for finding stabilizing mutations in proteins and tweaking protein function. So a whole other part of my lab works on sort of the protein science side of things. And they are keenly interested in tuning enzyme activity, altering protein properties to either whether they're drugs or serve other functions. And we can, given a dataset like this, predict a bunch of stabilizing mutations. Whether the general predictions of activity enhancement, whether we'll be able to draw a standard curve that relates to delta, delta G there. I mean, if we could do that, that would be great. I'm not sure it will turn out to be that finely sort of grained. Yeah, I was just gonna say, I mean, certainly those are cell-based assays that can be done relatively quickly and from the idea of protein therapeutics, that would be a great tool to have. Yeah, immunogenicity is another one that we'd like to explore at large scale, right? To be able to engineer out immunogenic regions of protein drugs. But that's a different conversation for another conference. Great, we'll let Douglas move back to a seat. We'll open this up. I'm gonna impose the rodent rule and ask the first question. And then,