 All right, well, you know, I hope I'm actually able to perhaps convince you that there's sources of adversity beyond just the genetic background and the allelic structure of themself. So, I was asked to give this presentation on something that I think is we've all observed in the IMPC at least for a while now, at least anecdotally, and I believe we're beginning to appreciate it's a little more common than we originally thought, or at least what I originally thought. And I think the reason I think this is important to discuss is because of the structure of our pipeline, and I'll go into this in a minute, and also the scale at which we're performing this screen, we actually have a unique opportunity to explore the biological phenomenon of variable expressivity and incomplete penetrance beyond the sources of variability that we've been discussing thus far. So as I mentioned, this kind of begins with a somewhat simple observation that at least to my naive expectation was different than what I expected. So, you know, the notion is that we're carefully controlling a lot of our experimental variables within the IMPC and Comp2 pipeline, including genetic background, careful SOPs for our phenotyping assays, environments are as carefully structured as possible. What we've been observing is, at least to me, surprisingly frequent instances of variable expressivity and incomplete penetrance in our mutants. And so this is just one of the many examples that we have. You can see this is a normal mouse embryo at E15.5. You can probably appreciate all four of these specimens are abnormal mouse embryos at E15.5. But what I wanted to point out in particular is that they're abnormal in very different ways from each other, despite the uniformity of the gene, the allele, and the genetic background and the zygosity. And Steve's not here. I made this slide specifically for Steve because we frequently have a debate as to what these two terms mean, so I thought I'd take just one moment to define them. I think we all know what they are, but so we were at least all on the same page. We speak of incomplete or reduced penetrance. We're speaking about cases where individuals bearing a mutation do not display a specific trait, something that it's yes or no. It's a binary type of trait, and it typically in our experiments applies specifically to categorical variables. When we put something in a bucket, it's either is in the bucket or it's not in the bucket. When we speak of variable expressivity, these are individuals bearing the mutation have a range of phenotypes or a range of severity for that given phenotype. This of course can apply to the same traits that we use categorical variables to describe, but I think the most important point here is we lose that information by putting them into a categorical bucket. We know this, and we lose some of the nuance that comes with the variable expressivity. In many ways, this is best described with quantitative or what we call continuous trait or approaches. And I think this is a theme that, or exploring both of these is something we're very interested in doing. It's not just limited to the mouse. I actually did a little background research on this, and it's no surprise that in human genetics, there's a lot of variable expressivity. We've already heard about some of this today, but I was actually a little bit surprised by the degree of a variable expressivity we see in monozygotic twins. And this is just taken from a review on this, and this is a series of different traits. I know you can't read this from here, but some of them are complex, like obviously complex like intelligence, but some are much more discreet like blood chemistry levels. But what you're seeing here in the red is actually the degree of concordance between monozygotic twins for males and females, which is at least, to my mind, lower than expected. And in areas that I know a little bit about, there's roughly 60% concordance for something that is considered a fairly simple and easy to assess trait, such as oral facial clefting. As far lower than perhaps one would think, given the homogeneity of monozygotic twins. So we've talked a lot about the different sources of variants. This includes allelic heterogeneity and or compensation here from NICO. I wanted to mention here, I don't know why it's bulleted under here, that's a mistake. The zygosity of the allele actually makes a difference. If we think about the example Gary presented of the F1 crosses to multiple inbred strains, the obvious way that those could be modified was modifying the expression of the existing copy of that gene. This is actually not going to be the case in any of our experiments. We're talking about other modifiers that compensate in very different ways. We heard a lot about genetic context and we know that that clearly modifies phenotypes in a lot of different ways. Experimental variables, the way in which we do the experiment, which is similar to the environment itself. And then of course there's the specter of epigenetic effects. And then of course stochastic mechanisms that could be at play. And what I wanted to just note here is that the IMPC in our approach, we account for a lot of these in a very specific way, which kind of focuses at least our thoughts down to these two areas here. And stochastic mechanisms is something I wanted to at least suggest is probably at play in a lot of our mutants. So why do we care about this? I think these sorts of random variants are typically considered noise, problems in your experiments, things that you may or may not be interested in, certainly things that you are going to try to mitigate. But I actually think we have an opportunity here. So obviously we can actually explore this because of the way we do our experiments. We're in a position to go after this because we try and we work very hard to control other sources of variability. And I think also our scale allows us to do this in a genome-wide basis. So we're able to look at many different genes that display variable expressivity. As I'll describe just based on some of the existing models for why this might be occurring, I think this will provide some insight into the mechanism and architecture of robust networks and particularly how these individual genes fit in this network, in which case we can begin to think of variable expressivity as a trait that is part of the gene function that we're looking at. And I also think deeper understanding of this will help us understand these pathways better. But also we've heard a lot today about how do we interpret variable expressivity in human genetics and if we can perhaps add some information as to when this happens as in a stochastic way, we can perhaps help them sort out the difference between environmental driven variability and actually fundamental variability that's due to the network structure. So what are some of the models of variable expressivity? I think this is a really interesting exercise because there is a decent body of literature around this. One of the ways in which I'm not sure is particularly relevant here is that there's a lot of data suggesting that loss of global modulators of robustness, these are types of chaperone genes, et cetera, that help provide robustness to, say, a hypothetical gene signaling network that I've sort of driven out and drawn out in a schematic here in buffer against environmental or even genetic variation, allowing this to be a fairly stable pathway. So in this sort of robust hypothetical network, you see, in this case, a chaperone protein sort of prevents over-variation in these different components. And when you lose that buffering, the network is now a little more unstable and it's now a sensitized phenotype. It's more sensitive to environmental perturbations or in some cases genetic perturbations. And when you lose components of this pathway rather than because of the way the pathway is highly integrated with multiple inputs and connections, you don't completely lose the pathway. You end up with a noisy pathway, particularly down here where, say, there's amplification loops at the sort of tail end of any given robust network. Another way in which this is thought to occur is the fact that paralogs in the genome can provide some level of functional redundancy, albeit imperfectly. So how would this work? For example, this is our same sort of hypothetical network. And in the case if you lose a key component here because of its position in the overall network, the threshold of activation of this effector molecule drops below the level in which it can amplify itself and you have a severe mutant phenotype. However, if you had a paralog that fit into this network, perhaps incompletely, rather than losing the complete activation of the pathway, you end up with a more or less variable pathway because the thresholds here are varying. And then finally, the networks themselves can provide that level of integration such that losing a component doesn't completely ablate the pathway, but it creates a noisy situation in which the, this should say, variable phenotype occurs. And there is actually quite a bit of data from other model organisms and I just wanted to point you out a fascinating paper from back in 2010 from Alexander Van Audenarden's lab where they actually showed empirically that this occurs when you lose certain genes within a given pathway that leads to intestinal differentiation in the C. elegans. Rather than completely losing the pathway, you end up with a loss of expression or variability in expression of the upstream modulator, which then leads to almost a randomness of whether this gene rises above a threshold that allows for itself amplification. So I wanted to just quickly go through some of the examples of variable expressivity and incomplete penetrance in our IMPC data, some case studies and some areas in which we can continue to work on and explore. So start with a special case. I don't know, I think it's special because it's a, it's, the phenotype is not particularly specific, which is our sub viability. And then we can look at this, what we're seeing in morphological phenotypes in our embryo data. And finally, some work that is coming out of JAX led by Vivek Kumar around exploring variable expressivity and continuous traits. So I know of the IMPC folks. This is an old hat, but I just wanted to explain it briefly, which is we are able to, we assess our viability of all of our knockout mice when we're breeding our cohorts. And what you see is that, what we see is that a significant percentage, that's roughly 25% are completely embryonic lethal. We acquire no homozygotes at wean. But what was surprising to many of us, I suppose, is that we saw what we call sub viability. That is less than the expected number in our structure, and our structure is actually less than half the expected number of homozygotes are achieved. And you can see, this is a significant portion of the genome that we've knocked out, where we see variable lethality. So this is a plot of all the subviable lines, or is actually the viability of subviable lines in the IMPC right now. Just to orient you, this is sort of the odds ratio for any given line. Of course, down here at zero, these are all of our lethal lines, and there's a lot of these. And then we have this set that we're calling the SOP subviable. These are those that we define as half the number of, half of the expected number of homozygotes achieved. And so these are, you can see what's really interesting is it's a continuous line. So these are the 95% confidence intervals. And you can see it's a continuous line. There isn't any discrete break points in this. So we're getting the full range of variability. And Hugh Morgan and Sarah Wells have actually determined that, actually because of the large number of animals that we're breeding, statistically you can find deviations in the expected genotype ratios beyond what we're calling our SOP, which may be interesting as well. But I think the point here is that you get the full range and the full of variability in the lethality for all these mice. And so we asked some simple questions. If you recall back, we, paralogs are one of the potential explanations for having variable expressivity. And so we asked in our paper a couple of years ago, are the subviables any more or less likely to have a paralog? We know from previously published data, including Jackie White's paper in 2000 and before prior to this, that a high percentage of lethal genes lack a paralog and this makes sense. You only have one option, you knock the option out, you're going to impede the viability of the organism. But what was somewhat surprising to us is that the subviable genes did not adopt an intermediate level of a number of paralogs. They actually have less than the number of viables or similar number to the viables. At least suggesting that the hypothesis that an incomplete paralog insertion into the pipeline or into the pathway would allow for sort of noisy outputs. And this is recent data that Pilar and Damien put together, which is just basically saying exactly the same thing. And in fact, the effect is a little more amplified that the subviable lines of all the different classes of lethality or a viability have the least number of paralogs. What was interesting, and this is just some preliminary data on protein complexes, is that this doesn't hold true necessarily for protein complexes. So there may be points in the integration of the pathway where this would work and other points where it would not. As I mentioned before, the other place where we see this is in our gross morphological data. Again, this is from our paper in 2016. This is the ACVR2A, but if people know something about the BMP singling pathway, they probably know there's an ACVR2B. So this is very consistent with this idea that paralogs are able to provide imperfect, incomplete signaling, but rescue the phenotype from complete lethality and complete loss. However, we have far more than just a few areas. Subviability is just one case, and the morphological features are just one case. What was remarkable to us is how pervasive this is across all of our morphological phenotypes. So this is data from the DMDD, our colleagues in the Welcome Trust at Sanger, and what they showed is that actually for all of these individual lines, each one across here is a line, the phenotypes, and of course the color is the number of embryo specimens that display that phenotype. And you can see there's very few actually that have the deep red that would indicate full penetrance. And so for the most part, you're actually seeing a remarkable degree of variable expressivity. This is our own version of the same heat map. This is at our E18.5 gross morphology. And while you can see there's a few examples where we have highly penetrant outcomes, the rule is actually variable, variably penetrant outcomes. But these are all categorical data, and we know these are not necessarily ideal. We've talked about different methods that we could go after this. And of course, we could use tools like K-means clustering, which is a really great way to describe variable penetrance. But the only point I wanted to make here is that these buckets, the categorical aspect of this, is actually still quite limiting. I think I heard Lumper versus Splitter earlier today. There is no way in which you can properly capture all of the individual unique nuances of the phenotypes by putting them into categorical buckets. As all of you know, we have a solution to this. And this is to quantitate our image data, quantitate our morphological data. So this is the approach that we've been using, which is to use microCT and do automated registration to an atlas. And of course, the atlas allows us to quantitate the volumes of these individual structures in the embryo, which could give us more information about variability, except for the fact that this pipeline is structured as averages. So all of you in the IMPC certainly know well, we average the knockouts, and we average the wild types, and we register them together. So the individual differences are lost in this approach. And speaking with Henrik Westerberg, where they're starting to develop approaches that allow us to measure one versus many in the large catalog in order to quantitate these separately. And I think this will give us much better tool to go after variable expressivity and morphological phenotypes. So the final thing I wanted to talk about was some of the work that Vivek and our stats team has been doing on variable expressivity and continuous traits. And so you can see, obviously, we know that there is heterogeneous variance or heteroscedacity detected in the statistical pipeline. In fact, talking to Jeremy Mason, we have something on the order of 324,000 cases of this that we detect in the statistical pipeline, and we have method to account for it. But that accounting, while that's useful in order to find the differences in means, which is sort of the fundamental goal, actually loses this as a potentially interesting trait that we could look at, discuss, and somehow interpret. And so the question is, can we actually find genes and treat these as a phenotype? And this is just an illustration of the level of variance. We see this as a whole long list of Jack's alleles. And you can see there's quite a bit of noise in the degree of variance. And the other thing to worth noting is, as we get more significant in the mean, higher Z score, actually the degree of variance increases. And that's something we often see. So what is the approach? So Vivek and the group came up with an approach to basically use three different tests, F-test, Bartlett, and Levenes, in order to measure variance across our data set. The key thing here is that the data test or our control strategy is very important, because we really don't want to be measuring or we found that the variance there affected our results overall. And then they've begun to do some clustering to try to find a way in which we can understand whether there's any groupings of variably expressive genes that shake out and look interesting. So a summary of the results that they found. So far, one thing to note is that both the F-tests and the B-tests are much less conservative than the Levenes test. And the reason is, I think, partially because these two tests are very sensitive to non-normal distributions of data, so it makes a lot of sense that you would see a greater degree. However, they went ahead and they used basically a two out of three heuristic in order to come up with a list of genes that we're calling, at least as preliminary variably expressive genes. And when you look at them, it's really interesting to see when you look at a set of these genes that there is quite a range of variable expressive phenotypes that can be found. And so my MYH1 is our winner of all of this. And you can see there's greater than 20 different phenotypes that appear to be variable. And I'll note the directionality can be in either direction. So this is agnostic as to the directionality. So in some cases, at least for this gene, you actually see a reduction in variability versus the control population, which may actually be interesting in its own light as well. We also see that different phenotypes have different rates of variable expressivity overall. And this can say something about both the nature of the phenotype, but can also indicate some areas of QC that we could go after and look at it a little more carefully. And I mentioned the clustering allows us to really start to understand how genes that are variable, how they fit into certain phenotypes and how they work together in an overall network. And this is just some of the initial work that we've been doing thus far to try to piece together where these fit into robust networks and begin to pull some interesting biology out of what I think we thought initially was an annoying observation. I have to mention a couple caveats. Before we get too far down the line in thinking that variable expressivity will be a really interesting thing for us to go after. The first is we have to understand are we making null-lezygous alleles and how often are we not? This is a high throughput program. Some of the, there's been some data in the past, certainly in our data set and IMBC overall with the TM1 alleles, these are not universally null. And of course we know our CRISPR editing is really restricted to our understanding the gene models. We may not get this right every time. There may be slicing events that we do not appreciate. We have to be wary of that. We don't want to over interpret. There's also local effects of targeting. We certainly with our ESL based mutagenesis, we know this is the case. David West's paper a few years ago showed that the cassette can influence the expression of the genes nearby. And if those genes are part of the pathway, you could understand why there could be variability. We could also be disrupting non-coding but express features such as, well, non-coding or express features. And these are things like link RNAs and micro RNAs, et cetera. These are all potentially part of that including our velocigene alleles, which are a small proportion of our pipeline overall, but delete a large portion of the gene model including everything that's possibly in between. And of course I can't not mention the potential specter of off-target mutations. And I wanted to make sure I say both ESL derived and CRISPR because we know both contain off-target mutations. So just a few final thoughts. We observe what I would call pervasive variable expressivity in our mutants despite the careful control of all these of our experimental variables. And then I think this is really an interesting and unique opportunity. We have, we need to develop additional methods to fully describe the phenomenon, particularly, well in all case, in all categories. So categorical, image and continuous phenotypes. The IMPC offers, I think, a unique data set to go after this. So we have a large scale, systematic, systematically generated set of data that is highly controlled. And I think it can provide deeper insight into gene function and pathway organization and how these genes fit into robust networks. And I think, and I could be wrong, but I think that this can impact how we interpret variable expressivity in complete penetrance in our human disease data sets as well. And so with that I'm going to thank everybody and the IMPC overall and the people I credit along the way. And I know there's a couple missing from here and I apologize for those two centers that are not on this list. I will update my slide. Thank you.