 So I just start off by thanking the organizers for giving me the chance to participate in this workshop and I think my talk is going to be a little bit of a U-turn from the past too and detouring away from talking about data integration and readdressing I guess or revisiting some of the things that were brought up in this morning's session. So I'm going to talk about de novo mutation and in particular how the deluge of genomic data that's now available in the public sphere is really transforming our understanding of how new mutations arise across genomes, some of the challenges that we face when trying to identify new mutations and then I'll wrap up by talking about what at least I perceive to be some of the major outstanding challenges in this area. So mutation obviously really kind of lies at the heart of comparative and evolutionary genomics. It's the ultimate source of all new variants ranging in scale from single nucleotide variants to insertions or deletions to larger scale structural variants such as copy number differences and inversions and all the way up to chromosomal level rearrangements like chromosomal fusions and whole chromosomal aneuplates. And it looks like the MAC to PC conversion is a mutagen itself here. So it's obviously been appreciated for a long time that new mutations play very important roles in human disease and over the past decade whole genome and whole exome sequencing studies have successfully identified a number of causal mutations contributing to a whole host of Mendelian and rare human diseases. And similarly studies of large population and disease cohorts are starting to uncover the role that de novo mutations play in the genetic basis of more common diseases, diseases like epilepsy, schizophrenia and autism. And of course knowledge of how new mutations accumulate and the rate of mutations is really of critical importance for evolutionary biology. So the mutation rate, MEW, is a key determinant of the level of genetic diversity within a population, also helps to determine the level of divergence between two species. In order to build phylogenies, right, we all invoke these models of nucleotide substitution that reflect how new mutations interact with natural selection and genetic drift over time. And knowledge of the mutation rate, of course, also weighs in on our understanding of how organisms adapt to their environment, notably the relative importance of adaptation from standing genetic variation versus new mutation. And a point that is maybe a little bit peripheral to the focus of this workshop, but I think it's still very, very important, is that mutation also leads to genetic drift in our inbred animal models, which obviously has a really profound impact on the reproducibility of research. So in the past, classically, to estimate mutation rates, we've relied on approaches like surveying the incidence of spontaneous phenotypic mutants within populations, or comparing sequences between two species that diverged some known point in the past, or making inferences from levels of population diversity at a handful or a limited number of loci. But now with democratization of whole genome sequencing, we have this ability to just directly measure mutation rates by comparing parent and offspring genomes. So this is basically a comparative genomics on the time scale of a single generation. So new mutations are very simply detected as variants that are present in the offspring, but not seen in either parent. And so using this strategy, the mutation rate can be calculated as the number of new mutations that we see divided by the number of sites in the search space, the genome size, right? And this sounds so simple to do, right? But the reality is, in practice, this is really challenging, and that's because there are some key challenges associated with estimating both the numerator there as well as the denominator of this simple equation. And these challenges really sort of tie back to some of those core issues and short read alignment that we were talking about this morning, that these challenges impose some key difficulties in how we actually identify new mutations in the first place, and secondly, quantifying the number of sites over which we can actually reliably detect de novo mutations in the first place. So for one, short reads can yield uncertain or low quality alignments in regions of elevated divergence from the reference, and this is potentially a source of erroneous variant calls. And this is a problem that presents its head when we're trying to align reads in very rapidly evolving regions of the genome or trying to align reads that are derived from one species to the reference genome of a closely related species. And I want to just quickly highlight one example that I think really goes to underscore the severity of this particular challenge. So some of my mouse genetics colleagues at the Jackson Lab and at the University of North Carolina recently identified de novo mutations across the genomes of an eight-way recombinant inbred mapping population of mice known as the collaborative cross. So each collaborative cross genome is a unique genetic mosaic. I'm trying to use the pointer here, but it's not showing up. It's showing up on the screen. Okay. There we go. Okay. It's a unique mosaic that carries genetic contributions from eight phenotypically and genetically diverse founder strains. So the haplotype contributions from each of those founder strains are designated by a unique color here on these sort of cartoon depictions of their genomes. So one of these eight founder strains is, in fact, the mouse reference strain, which is a representative of the mus musculus domesticus subspecies. And two more of these founder strains are derived from a very divergent subspecies of house mice, mus musculus castanus and mus musculus musculus. And what my colleagues find when they look at the distribution of new mutations across these collaborative cross genomes in aggregate is that across regions of the genome that are inherited from these divergent strains, there are nearly two times as many apparent genovo mutations present as observed on regions of the genome that are inherited from the reference strain. So I think this really goes to underscore the extent to which alignments from these divergent sequences are really problematic. So structural variation is another particularly large problem for short read alignment, and this was a point that was raised again and again in this morning session. So reads that derive from duplicated genomic regions, denoted here by the red triangles, commonly can't be mapped back to a single unique position in the genome. And so what you have then is the situation where erroneous alignment of one read from one duplicate to a copy somewhere else in the genome can give rise to these false positive calls. And similarly you can have cryptic structural variation in your sequence sample relative to your reference genome that can result in false positive calls. So for example when the sequence sample harbors a duplication that's not present in the reference, reads that derive from that sample are going to be collapsed back to that single point in your reference sequence giving rise to a false positive SNP call. So these challenges are made even more acute I think when we stop to consider the fact that interlocust gene conversion between homologous duplicated sequences can transfer variants that arise in one duplicate, duplicon, onto the backbone of another duplication such that the genetic identity of some of these duplicated sequences is actually really quite fluid. And what's more of course ectopic recombination between misaligned duplicates can give rise to deletions and duplications and these happen quite frequently in genomes. So these are very dynamic genomic regions that present a number of issues for short read mapping and the use of linked reads or long read methods may help skirt around some of these issues and help resolve sequence variation in these complex regions. But one additional strategy that I think might hold some promise toward rectifying some of these issues is to move toward utilizing collections of genome sequences as opposed to a single reference genome. And so catalogs of known variants in a given species can be indexed for short read alignment collectively forming what's called a PAN genome and the use of these multiple genomes including genomes that cover alternative structural haplotypes can really result in substantial improvements both to SNP call accuracy and discovery in the first place. So together these challenges with short read alignment lead to a large number of false positive calls that make it quite challenging to precisely estimate the number of sites in a genome over which we can reliably detect de novo mutations. And another key fact too is that the rate of sequencing and genotyping error actually probably exceeds the de novo mutation rate itself. And so this requires that we then impose a set of pretty stringent variant filters on our data in order to weed out true positives from this larger number of putative de novo mutations. And so there's no real clear best practices pipeline for doing this. And so the filters that end up being applied are admittedly ad hoc and variable from study to study. And so some of those differences in variant filtering pipelines and these challenges that we cope with with alignment can result in differences in the estimated mutation rate from study to study. And so here I've plotted estimates of the per base per generation mutation rate for a number of recent human studies. And you can see across these studies estimates of the mutation rate range from .96 to 1.3 times 10 to the minus eighth mutations per base pair per generation. And so the extent to which this variation reflects bonafide differences in mutation rate between study populations versus sort of noise that creeps into the system owing to differences in study design and differences in data processing is really not clear. But despite all of these challenges, the ease of sequencing whole genome sequences now has really sparked a lot of interest in understanding mutation. And that in turn has really allowed us to build on a number of prior observations about how frequent the mutational process occurs across genome and patterns of mutation accumulation. So for example, more than 70 years ago, Haldane actually presented some of the first evidence for a higher male mutation rate in humans. And more than a century ago, Weinberg presented some of the first evidence for a paternal age effect on human mutation. And we now know, of course, that these patterns are borne out genome-wide. So this is data from a recent whole genome sequencing study of a large number of pedigrees in the Islamic population. So each data point there represents the number of de novo mutations transmitted from moms and dads to their offspring. And so you can see there's a clear sex dimorphism for mutation rate in humans as well as this pronounced effect of paternal age with older dads transmitting more new mutations than younger dads. And beyond validating these known mutational properties, whole genome sequencing in human pedigrees has led to the discovery of some novel biology about mutation as well. So I'm briefly going to highlight just three examples. So first is that some recent work has actually identified a significant maternal age effect on mutation as well, although, as you can see, the effect is less dramatic than what we see in males. Second is that particular types of new mutations that arise within genomes vary as a function of parental age. So here are the fraction of new mutations belonging to each of eight mutational classes is plotted as a function of the parental age at conception. And you can see just from the variable slopes of those lines there that the types of new mutations that arise in genomes are really dependent both on sex as well as age. And a third key insight that has emerged from these whole genome sequencing studies in human pedigrees is that these direct pedigree-based estimates of the mutation rate that we have actually don't really line up very well with our prior estimates of the mutation rate based on human chimpanzee divergence. And so this difference may be due to difference of the mutation rate itself, but it could also be due to the effects of the evolution of life history traits that in turn influence mutation, so things like parental age at conception and age of onset for puberty. So these both represent flagrant violations of the molecular clock assumption that gets invoked when putting together these divergence-based estimates of mutation. So a second approach for studying mutation that has really been enabled by the availability of large genomic data sets, population genomic data sets, is to look at the frequency and properties of very low frequency alleles. And so very rare alleles are quite young, and so as a result they've yet to be really shaped by natural selection. So in this way they give us something of a naive window onto the processes by which new mutations arise. So some recent work from Kelly Harris has used rare population private variants to infer differences in the mutation spectrum. And interestingly what Kelly found is that the relative frequency of different mutational types, shown here as a heat map with these mutational types broken down by their 5-prime and 3-prime flanking nucleotides, these frequencies are quite variable across human populations. So for example, C2T mutations in this particular triplet nucleotide context right here are much more enriched in Europeans than in Africans. And so this is an interesting finding, but the extent to which this variation in the mutation spectrum reflects differences in environmental exposures to mutagens between populations or systematic differences in life history traits that in turn influence mutation, or the possibility that there are genetic modifiers segregating between populations that influence the mutation spectrum really is difficult to tease apart in humans where we don't have the ability to rigorously control for the effects of environment. This is however something that we can do quite easily in laboratory systems. And I'm going to gloss over the details for the sake of time here, but some recent work in my own lab has taken advantage of the unique origins of the laboratory mouse strains to actually infer the de novo mutation spectrum across a set of commonly used inbred mouse strains. And what we find is that these genetically distinct strains, all of which are reared in a common laboratory environment, differ in their mutation spectra. And you can visualize this here as just differences in the heights of these bars on this cumulative bar plot. So these strains all have very similar life history traits. As I said, they're all reared in a common environment. So this really helps us to isolate a genetic effect or a likely genetic effect on the particular spectrum of new mutations that accumulate across these diverse strain genomes. And one thing that I think is interesting is that if there are genetic modifiers of the mutation spectrum that are segregating among strains, it seems to imply that there may well also be modifiers of the overall rate of mutation that are segregating among inbred mouse strains. And so currently, my lab is piggybacking on some ongoing breeding efforts at my home institution, the Jackson Laboratory, where we maintain thousands of inbred strains for commercial distribution. So the propagation of inbred strains by brother-sister mating really approximates in many ways the design of a mutation accumulation experiment. So starting from a single inbred strain pair, which we've dubbed Adam and Eve here, we're collecting samples from one male and one female for two parallel breeding lineages for each of two inbred mouse strains, two commonly used inbred mouse strains. And what we hope to do ultimately is to use whole genome sequencing of some strategically selected subset of these animals in conjunction with high quality de novo reference sequences for these two inbred strains in order to estimate overall mutation rates and drive the mutation spectrum for these two strains. And then we'll end up comparing the two of them in order to establish an effective strain genetic background on mutation rate and mutation spectrum. So the picture that has emerged in recent years is that I think mutation is a much more complex and dynamic trait than many of us potentially previously really envisioned, which opens up some outstanding challenges and opportunities. And for one, the issues of identifying mutations in the first place, de novo mutations in the first place is I think a really important computational and bioinformatic challenge. And to this end, the use of pan genomes or multiple reference sequences can help mitigate false positive variant detection in some of these more structurally complex regions of the genome. And most work to date has also focused on single nucleotide mutations to the exclusion of more complex forms of variation, but long read sequencing methods, particularly when combined with short read technologies that have more accurate base calls, I think this could allow us to get at and probe the full spectrum of new mutations that accumulate. We've made a lot of progress toward understanding human mutation, but a more modest number of investigations have really looked at the complexity of mutation in other taxa, which I think limits our ability to really understand how mutation evolves. And I've obviously focused on germline mutation rates here, but mutations that accumulate in the soma are obviously of really intense interest, particularly for the relevance to disease and cancer, notably, and the extent to which germline and somatic mutations may be correlated or intertwined mechanistically, I think is something that remains to be really sort of teased out. And finally, I think there are some fascinating opportunities for experiments in model systems that really aim to disentangle the importance of genetics, environment, and life history on observed mutation rate variation. And thank you. Thank you.