 So it really is indeed a pleasure to be here. I was here, I think, five years ago last time. And my message has stayed hopefully, maybe not hopefully, but has stayed very much consistent. As Jim alluded to, our lab has been focused on what some people have called the darkest matter of the human genome, which specifically interested in regions that change very, very rapidly, specifically within the human species, areas of the genome that have been proven to be dynamic, both in terms of structure, their organization, and in terms of their evolution. As he mentioned, specifically we're focused as one aspect of that study on regions of the genome that are highly duplicated. And so these come in two different flavors. These are duplicated sequences that are either duplicated within a chromosome known as intra-chromosomal duplications, or duplicated between non-homologous chromosomes known as interchromosomal duplications. The reasons we're interested in this are really two-fold. One, these are dynamic by dint of the fact that they have sequence identity at very high levels to non-homologous recombination, promoting additional rearrangement events at the specific sites. The second reason is that if you believe the work of Susuma Ono and others, duplication is the primary force by which new genes and gene families evolve. So we're interested in these regions really from the perspective of dynamic mutation, de novo mutation associated with disease, and second from the perspective of the evolution, potentially of new genes and gene families within human. And both of those topics I want to discuss today. So I just summarize the work that came from really analyzing the whole genome, really the finished human genome. This is the pattern of the largest and most identical duplications within our genome. The blue lines representing these large blocks greater than 95%, greater than 20 kb in size of intra-chromosomal duplications. And you'll notice from this that a lot of our duplications are essentially interspersed. So if you look at chromosome 7, you find a lot of the pairwise, large ones separated by megabases of sequence. If you add the intra-chromosomal pattern, you get something like this. So this is the pattern of the human genome, and it's been relatively constant through the new assemblies. And most of this data I should point out came from back-based sequencing. This is the sequencing of large insert clones. If you go back and look at some of the first published whole genome shotgun assembly versions of the human genome, these varies the genome are completely missing. The important point is about 60% of the large duplications within our genome are interspersed. As I say, they're separated by at least a megabase from their nearest neighbor, or they map to another chromosome. If you contrast this with some recent data that we've done with Deanna Church and looking at kind of the comparable finished version of the mouse genome, this is the pattern that you see for the most identical and the largest duplications within mouse. So in this, the total amount of duplicated sequence in the mouse genome now turns out to be very similar to what we saw initially with human, roughly 5% of the genome. But you'll notice two things. The actual locations of these are fewer in number, so they're about half the number of sites in the mouse genome that are highly duplicated. And the second thing you'll notice is that essentially most of the lines, the blue lines here, which indicate intra-chromosomal duplications are right on top of one another. Suggesting that most of the duplications in mouse, about 82% of them are tandem. That is to say, clustered in orientation. So this difference between man and mouse in terms of finished back-based sequence assembly has important ramifications both in terms of evolution, the fact that you can juxtapose different pieces of DNA, creating complex configurations that you don't see in closer-rated species, and also in terms of disease. So its importance in terms of disease comes from really this, some of the seminal work from Jim Lapsky and others in the early 1990s. And the idea is very straightforward. If you have duplicated sequences within a genome, you can trick the recombination machinery during meiosis to recombine where it shouldn't. So here's showing two of the four chromosomes aligning during meiosis, the duplicated sequence showing in green, and a non-elique homologous recombination, also known as an equal crossing over event occurring, leading to gametes that have accumulated an additional copy of that duplicated sequence or have lost a copy of that duplicated sequence. So the really important part is that if these are essentially interspersed, imagine intracromosomal duplications now with unique sequences encoding genes A, B, and C, genes A, B, and C get taken along for the ride. So in addition to producing gametes that have additional copies of that duplicated sequence, we now have gametes that have additional copies of genes A, B, and C, and we have gametes that have lost copies of A, B, and C. If those genes are triple sensitive, A is sufficient or imprinted, the result is disease. And so there are, at this point, about 30 different syndromes in the human population that are caused precisely by this mechanism. It is not really a genetic disease because it doesn't have to be transmitted. There's something that goes on in all of us as we sit in this room and we produce gametes, either egg or sperm. And an architecture that has a lot of these interspersed configurations is obviously going to be proposed, be predisposed to these types of events that have a lot of frequency. So these are some of the diseases. I'm sure many of you have heard of some of them. Velocardiophageal to George, Williams syndrome, Prada-Willi, and so on. There are two interesting aspects about these diseases if you look at them. So shown here is the actual size of the duplication, which is mediating the rearrangement. And the important point here is that most of the events are large. The duplicated sequences have to be greater than 10 kb, often greater than 100 kb in size The second point is that the degree of sequence identity is also very high. So typically most of the diseases are caused by duplications that are greater than 95%, and the vast majority are greater than 98%. And the third component of these diseases, which you can kind of see here, is that the vast majority of diseases that have been described thus far involve some type of neurologic component, either peripheral nervous system or central nervous and cognitive deficit with these kids. So the hypothesis was very straightforward. If we had a beautiful duplication map of the genome, which was born on the sweat of a lot of wonderful people working on this project over the last 10 years, could we use that as essentially a morbidity map to predict the sites of disease associated with these specific regions? And specifically could we focus on children with mental retardation to find new diseases previously unknown? So this is this duplication map I showed you again. So it was not just a quality control exercise for the human genome project, but we actually viewed it as a disease map. And so here's our roadmap. All the gold bars represent blocks of sequence where the architecture is such that you would believe to be a high frequency of de novo mutation based on very large, very identical sequences at these positions. So there are roughly 130 regions at that time of the genome. 23 at that time, which are the gold bars with letters behind them, were ones already associated with disease, that the remaining, at least some subset of those remaining regions would be associated with de novo disease in the human population. So the way we did this is kind of old technology now, but we began this work about two and a half years ago. We targeted all of our regions that had at least 50 kb of unique sequence, less than 5 megabases, that were flanked by duplications greater than 95 percent identity and greater than 10 kb in size. We took backs from these regions and we built a specialized microarray that contained about 2,000 backs from these roughly 130 regions of the human genome. We spotted them on a microarray and we simply would test the given normal DNA sample labeled with one fluorochrome against a diseased individual labeled with another fluorochrome and looked for signal intensity differences based on hybridization to this chip as evidence or gain of loss of that specific region. So in terms of a study populations, we used a normal control group, which people have argued maybe isn't the best control group, but it was what we had available at that time, which included all the hat map samples, as well as an additional diversity panel of roughly 45 individuals. So we used these normal individuals to establish the normal pattern of variation within individuals without disease or at least without disease associated with mental retardation. So I'm not going to go over those details other than to say that we found a lots of copy number variation. So harkening back to something that Claire mentioned, the human genome has riddled with differences in gains and losses of sequences in different individuals. We then focused on a collection of kids that essentially the clinical community or at least diagnostic community had given up on. There were roughly 500 children children which had been tested for fragile X had been come back negative children tested for subtylomeric rearrangements and children whose carrier type was normal for testing using this platform. So some of the results. So after screening the normal collection then following up with studies of these three roughly initially the first 291 children from Oxford. We found regions of the genome that look like this. So what you're looking at here is a log 2 relative hybridization intensely plot for four different individuals. These are all children with mental retardation and we're looking for things that deviate from the log 2 ratio of zero which would be no difference. And you'll probably notice that there's a lot of noise over these regions. This is because about a third of our probes were actually selected and really isn't 2N but it's actually more than that so this actually creates some background. But clearly there's something different about these four kids. They have essentially about five backs that are apparently showing evidence of micro deletion in a region that we never saw once in a normal control group of study. These are validated by fish. I think the most probably interesting aspect is we could actually go back now and do a more high density oligonucleotide customized microarray so instead of using five backs in the region we designed now 11,000 oligos over that specific region and really confirm to see whether the break points were identical. So showing here those four children once again this is the log 2 relative hybridization intensity depression shown here in terms of log 2 indicated by significance in terms of when you see the red signal. And what you can see here is a couple of things. First off, if we compare the affected child with that of the parents so this is one of the children compared to mom and dad in the parent area but they essentially have a deletion the child has a deletion roughly 450 kb precisely at that site. You also notice here this is the segmental duplications these are very large highly identical duplications which share about a 99% identity over 100 kb in size. So the duplications are demarcating the boundaries or the break points roughly which you'll also notice when you look at the regions contained underneath this is the deletion as well. The important point here was that we had essentially an identical critical region in four children identified from the study of mental retardation all of them had hapline sufficiency at least as that's our model and in fact all of them that we've been able to test so far were de novo events. In other words parents did not have this lesion this was seen specifically in the kids. On top here are some of the genes and there's five genes mapping into that region but obviously there's some great candidates. One of the most interesting is map T also known as tau it's a gene in which point mutations have been associated with Parkinson's, Alzheimer's and frontal temporal dementia. So we're screening now patients which have essentially phenocopy in terms of disease and looking for point mutations. I'd like to just emphasize or make this note that even though we screen only 300 kids this was roughly of the idiopathic collection that we looked at was roughly the total in terms of disease. These are what the kids look like so we now in collaboration with our former competitors Bert De Vries in Holland we've had the opportunity to look at roughly now 21 children all of them which have the micro deletion 19 of them we've been able to look at parenterals and show the de novo events and if you look at these kids you can see there's some similarities in phenotype. One of the most pronounced believe it or not is this very bulbous nose you can see a pronounced philtrum sometimes protruding tongue as well as a fairly happy disposition which has actually been noted in many of the clinical records. The children have a better outlook than most of us in terms of life and in fact we've now been able to go back and identify from de novo collections being able to show clinicians the data they've been able to identify additional kids using this approach. So one of the interesting parts of this particular we think is a new deletion syndrome the exact same region that we identified as being deleted in the human population was a region that was described a year and a half earlier by Kari Stephenson from Decode as being a site of a common inversion polymorphism in humans and shown here is the region once again this is actually looking at the SEF diversity panel and the black indicating the frequency of that inversion that inversion is essentially restricted largely to Caucasian populations both European and Mediterranean populations have this inversion most common. You'll see once you get into Africa and Asia and Amerindia see very low frequencies of this inversion. Their data suggested this inversion for completely different reasons was associated with increased fecundity and associated with increased combination in these populations that was based on genealogical data from the Icelandic population. So we went back and we looked at our kids to see if they came from haplotypes or if they carried the inversion and to date 19 out of 19 cases all come from this inversion haplotype. So I want to make it clear that we don't know necessarily whether it's the inversion that's predisposing to this micro deletion event or it's something else on that haplotypic background which may be predisposing but the data are overwhelming that this inversion polymorphism which is ethnically stratified is essentially predisposing or the inversion haplotype is one of the ramifications would be that this is largely a Caucasian-specific idiopathic mental retardation syndrome and our screening so far of African Americans has shown no cases of screening of 500 kids of this particular deletion. That wasn't the only one we found so here's another region on 15 Q2 4.1 to 4.2 4 megabases in size these are the children these are their actual genotypes on array CGH over oligos rate points in three of the four cases occur precisely at regions of high sequence identity in these three cases we know that each of these events is de novo these kids are fairly high functioning they have IQs around 65 to 70 they have been described as autistic spectrum disorder but they have extra features such as growth deficiency here's yet a third example this is distal to the Prada-Wooley region on 15 Q1 3.3 our initial screening we skipped over this region and that was because of our criteria this index patient here had a break point between break point 3 and break point 5 actually was not a de novo event so we looked at the parents the mother actually had this very large deletion over this region however it turned out that the mother also had mild mental retardation as well as epilepsy we got two additional cases that came in both of these cases were smaller they were between break point 4 and break point 5 these particular cases were both de novo and in both of these cases there's also mild mental retardation or developmental delay and epilepsy we don't know for sure if this is a genomic disorder but we're betting it is what's particularly interesting is that there is one gene located here, Cherna 7 which is a gated ion channel gene which has been implicated I don't think ever proven to be associated with myoclonic epilepsy so we believe that haplitis efficiency of this region also causes disease and once again the break points are mapping to these very large highly identical duplications the last example that I'll show you is example of recurrent deletion not associated with mental retardation so we've now moved outside of kids with mental retardation started screening kids with other types of pediatric disease and so this is analysis of some of those children this is a collection of roughly 80 pediatric patients with renal disease that have been screened what we found in this particular case was once again a de novo deletion should point out that all of these cases are de novo with respect to break points embedded right within the segmental duplications what's particularly remarkable about this disease is that at least in terms of the studies that we've looked at or the samples that we've looked at this is largely with Christine Belen-Chantelot it accounts for about 20% of pediatric patients with renal disease that they have in their collection so it's a very common what we think is a common micro deletion interestingly enough it's never been observed once in a control group of 927 individuals and interestingly on top of that is essentially that about 36% of children with maturity onset diabetes of the young type 5 also have the same micro deletion there is a gene in this region TCF2 transcription factor in which point mutations have been shown to be associated with both renal disease and modi-5 diabetes so in summary we've actually looked at now a large number of kids particularly from the IMR study these numbers are based largely on the initial 300 set from Oxford and in these patients we identified what we think are roughly 16 sites of novel structural variation I wouldn't claim that the majority of those are causative but I do feel comfortable saying that we do have at least three novel genomic disorders in which we have de novo events recurrence and we have phenotypic similarities that actually allow us to assign this as a new disorder we have one example of a micro deletion event associated with diabetes and renal disease and I'd be willing to hazard a bet that if we screen more children with more forms of pediatric disease we'll actually find additional genomic disorders associated with a wide range of phenotypes I'll just leave this one slide here as an example of why I think this is so important we just finished screening using the Lumina platform with Debbie Nickerson and Greg Cooper in my group a large number of normal individuals these are individuals that came in essentially for lipid testing as part of a study known as the park study and shown here is essentially hotspot regions that were found deleted or duplicated within this normal control group so shown here are the duplications in pink and the deletions shown here in blue these are the number of chromosomes from this collection of roughly 1920 chromosomes that were shown at various frequencies so here is the absolute number, here's the 1% frequency cutoff and here are a bunch of events that are roughly 0.1 and 0.2% frequency so coming back to the point that Richard made two issues that I want you to think about roughly in this group we have 6% to 12% of normal individuals having big deletions precisely over regions that are non-allelic homologous recombination predisposed we have an excess which we don't understand why in terms of deletions versus duplications but we have an absence of things that are around the 1% to 2% to 3% frequency I would bet that these are being fed by these mutations at a high frequency in the more normal pool and the question remains open what is the impact of these in terms of disease or susceptibility so one of the things you might ask yourself is why if you think about the mouse genome architecture and human why do we have all these large blocks of inter-chromosomal and inter-chromosomal duplications if they predispose 10% of our genome to micro-delete and micro-duplicate at a high frequency so we have tried to address this question over the years and maybe I'll kind of go over these fairly quickly but the idea one of the important things to realize about the duplication architecture in these regions it's not just one piece of sequence it's essentially heterogeneous made up of many different parts that have different evolutionary histories and trajectories so this is one of roughly the 400 regions in our human genome and each of the colored and gray are all duplicated so basically this full 790KB stretch of DNA is entirely duplicated when you actually reconstruct the evolutionary history of this region what you find is that everything in color we've been able to show comes from a different area of the genome so we have essentially this hodgepodge mosaic over these specific regions made up of all of these pairwise alignments from all over the genome to complicate matters these regions then duplicate blocks and can share large blocks of homology in common with another and these are the types of events the secondary events that actually predispose to micro deletions and micro duplications associated with disease so we have this architecture can we systematically reconstruct the evolutionary history of these regions and so working with Pavel Pebsner we came up with an approach to look at all the individual pairwise alignments within the human genome then make up these duplication blocks in evolutionary shared segments so we could break all these pairwise alignments into individual subunits and then using data largely from work from UCSC basically compare these regions of the human genome all the duplicated positions to see if we can identify the ancestral segment from where the duplication began therefore provide directionality in terms of the duplications and here the logic is pretty straightforward and primate specific so if we look at outgroup species such as rat, dog and mouse that should not have these regions duplicated we should see a single hit moreover because the human copy that's ancestral moves by this multiple step procedure in terms of duplication we should see more orthologous anchors between the human and a million outgroup sequence so using this approach we define the ancestral origin for 67% of the duplications within the human genome validated by fish to see if we really could identify these ancestral origins so we take an outgroup species we take a probe that comes from the derivative locus and we hybridize to see if it goes back to the right spot that we predicted that confirmed in this case a relatively small number of experiments not out of 12 times we then also compared our experimental maps which we had generated over the years before with our unsilical prediction with Pavel and you can see that there's pretty good correspondence between the duplicons that we identified so what do we learn from this analysis? so here's the part that we learned if we start looking at these intracromosomal duplication blocks that cause disease what we find in almost all cases with one or two exceptions is that shown here is a map of the duplication blocks so these are all the duplication blocks that have emerged in the last 25 million years on chromosome 15 about a third of these cause disease one of the things that we find is that located almost precisely in the center of these blocks is a common sequence in about 90% of the cases at least for this specific chromosome this is what we call a core duplicon it has a number of interesting properties it's the most abundant and most ancient as you might expect in terms of duplications even though these have all heterogeneous histories it is common to the vast majority seems to be the focal point for intracromosomal duplication formation cores are frequently duplicated as solo elements in the genome but rarely are the flanking duplicons so the flanking duplicons are almost always exist in association with a core and when you look at the cores they are enriched four to five fold for both genes, at least annotated genes per base pair as well as ESTs so these seem to be the most transcriptionally active most dynamic areas of the genome when we compare those cores and we find them on about a half a dozen human chromosomes that have experienced this burst of intracromosomal duplication what we find is that these cores are often associated with great ape and human specific gene families that have been described in the literature over the last five or six years we describe one of the first called the nuclear pore interacting protein which evolves about 50 times faster than most normal genes based on DNDS ratios there's a number of other genes that have been described the common features of these genes is they do not have orthologs in mouse they have multiple copies in human and chimp they show dramatic changes in their expression profile compared to at least out group species such as baboon or macaque and at least three of the four examples here show signatures of positive selection and in two cases very dramatic examples of positive selection so I'll just finish off by actually sharing with you some of the work we've been able to do with Eric and NIST particularly in this regard because these regions are so complicated we really can't get a handle on their architecture from looking at whole genome shock and sequence assemblies of chimpanzee gorilla, macaque and so on so working with Eric we've been able to actually target these regions and re-sequence them systematically in a number of primate species so shown here is another core region just to give you an idea these are all the locations of these cores and this is I should say these duplication blocks so this is about 250 kb in size and there's this core of roughly 20 kb which is an 18 of the or the 14 of the 16 blocks that are shown on this chromosome this is a core which is particularly interesting as it has a very rapidly evolving gene family embedded within it this is the nuclear pore interacting protein such that if you look at the actual degree of sequence identity in sliding windows across this region of the genome this is actually comparing any two of these copies you will find troughs and peaks in terms of the sequence identity and what's most remarkable is that these troughs correspond precisely to the position of exons this is this 8 exon gene which with no known function and the other thing I'll just point out is that 98% of these changes are resultant amino acid changes between the copies this is an extreme example of positive selection so working with Eric we drilled down and looked at a lot of other copies in other primates particularly focusing on gorilla, chimpanzee, orangutan, and baboon we sequence annotated all of the sequences that we got back both experimentally and computationally then we constructed a phylogeny of these segments so I hope you don't go blind but this is the actual phylogeny shown here this is based on a neighbor joint analysis of 2KB of non-coding sequence for the core and shown here is the structure that you see with HSA representing human, PTR representing chimp, GGO representing gorilla, and so on so what we get from this bewildering complexity over this part of the genome are really a couple things number one, all of this architecture that we now see which we now know causes disease is about 10 million years young so all the events have occurred in the common ancestor of chimp, human, and gorilla or immediately after the separation of those species the second thing which I think is really interesting is that when we look at orangutan we find none of the core we see the core once again present but we see completely different flanking duplicons which are unique in human and all the other great ape species so orangutan has done the exact same thing that our genome did 7 million years ago and has been doing the same core but has actually picked up completely different flanking sequences which are unique in chimpanzee unique in gorilla and unique in human so this tells us that this core is actively transducing we think segments of the genome around and just to give you a perspective back now 25 million years ago these are all the pieces that in human look like this sorry so this is the architecture that we see in human each one of these blocks of sequence are essentially unique in baboon they're unique in macaque so we think these all began as unique copy sequences with the core beginning to jump probably about 20 million years ago pick up flanks and continue to grow such that it now occupies LCR 16a and its associated duplicons about 10% of the U chromatin of human chromosome 16 in this case 16p with large insert sequences we also map the locations in orangutan and so this is the orangutan picture this is human 16 very limited activity on human 16 but here you see on chromosome 13 the core has essentially jumped jumped to a new chromosome and begun to do its dance again on these particular chromosomes creating a very complex architecture once again on chromosome 13 interspersed duplications distributed across in this case chromosome 13 so here I think the important point is the cores are mobile they can jump to new chromosomes and they can actually transduce flanking sequences as part of its trajectory I don't have time to go into all this data to summarize what we know about this particular core we know that it began as a single copy sequence about 25 million years ago and data that I don't have time to show you indicates that it was actually testis specific is expressed only in the testis and it showed no evidence of selection of any of the normal tests looking at KAKS so this is a little bit heretical I think because most people would teach you that all genes are born from other genes our data suggests that this thing was born from a transcript that was probably neutral in terms of evolution then about 25 million years to 12 million years ago and a common ancestor of orangutan, gorilla and chimpanzee and human it began to move, began to duplicate some copies on this lineage duplicating specifically heavily on chromosome 16 and here duplicating on chromosome 13 at this point when it began to duplicate based on expression analysis in orangutan chimpanzee and human we see ubiquitous pattern of expression it's expressed in every tissue in orangutan and in human that we've ever analyzed to date so we've looked at about a dozen in orangutan and about 32 different tissues in human between 7 and 12 million years ago not on this lineage but in the African great ape human lineage we see extreme positive selection KAKS values on the order of like 10 compared to the old a world monkey sequence suggesting that at that point some mutation must have occurred to lead to an open reading frame that was essentially selective and then became fixed at a very high frequency in the population so the 98% amino acid changes that I mentioned are occurring right here at this branch so we believe that the movement of this core led to the emergence of a novel gene family about 7 million years ago the youngest and most rapidly evolving genes in the human species so in summary, I've talked about the architecture of the human genome with respect to these large blocks of duplication talked about how complex they were and specifically showed you some examples of how these complex architecture can predispose to de novo large deletions probably a huge significant effect or selective effect within the population our targeted approach has uncovered four new micro deletion syndromes that have been shown that they were current de novo and I think the question remains unanswered what is the importance of this mechanism toward complex disease because if you think about it none of these events can be tagged using a tag snip because they're de novo, they're occurring on different in some cases on different haplotypes then I talked about the evolutionary significance of these regions particularly the core architecture that we think has emerged through account and particularly leave you with this kind of final thought maybe the negative selection of these micro deletion and micro duplication events that exist in our species that may be partially offset by the positive effect of having newly minted genes many copies of them at new locations and if you think about it there's a huge challenge ahead even though there are few in number is to work out what are the functions of these types of genes that don't exist in out group species STS and SNP mappers fear to tread and we hope to continue hopefully, I think my students say it's going to be on my epitaph that he found these genes but never actually showed the function of a single one maybe five years from now I can come back and share with you some evidence that they're actually functional so acknowledge these folks Andy Sharp, Heather Mefford they did most of the postdocs that worked on the human disease angle Matt Johnson and Zouxi Jiang they're both students who did all the work of these core regions good colleagues in sequencing centers Baylor Washu and specifically at NISC who rose to the challenge of sequencing some of these very difficult clones I think I have a reputation probably rightly deserved that these are some of the nastiest clones for sequencing centers to sequence thank the patients of people like Bashali Mascari Bob Lakesley and really Jeff Jerry Bouchard who really took these on to completion at least within the primates and a lot of great colleagues clinical colleagues particularly overseas that have been very forthcoming in providing samples and working on collaborations thank you any questions from the floor I'll ask one Richard's got one Richard you got one Evan what about wild mice have you got any duplication data there or they got the same low level of these events nothing on wild mice so no information on wild yet we have a lot of information from the inbreds over the duplicated regions and they show as much variation as humans do the only difference is that variation is restricted to the duplications which are tandem and doesn't influence these unique stretches between the interspersed duplications I want to ask a question we're trying to get a PowerPoint loaded so I'll also install a little Evan so the screen that you did of the pediatric patients the pediatric disease you'd likely find more of these copy number changes but there's no reason to think if you screened unusual adult onset diseases you might find similar we've toyed with the idea and we're thinking about doing a more adult oriented disease I guess I kind of have this fundamental belief that if we can show it at a pediatric level there'll be a stronger genetic component so I'm more interested in screening more kids with disease in which we don't have a good explanation of diseases where environmental play will play a bigger role and genetics might play less