 Good morning. I'm Laura Elnitzke, and I'm a PI at NHGRI. And I'd like to welcome you to this session of current topics entitled the regulatory and epigenetic landscapes of mammalian genomes. Now, due to rules, regulations, and government policies, I want to start by saying I have no affiliations with commercial entities. And I'm not here to make money just to talk. Nearly a dozen years ago, as this millennium began, President Clinton called the Human Genome Project the most wondrous map ever produced by humankind. And he was right. The study of the human genome is driving the entire future of medicine. As scientists directly evolved in disease research, we all are likely to use genomic sequence and the extensive amount of data on sequence variants on a daily basis. And yet, how likely are we to see a variant and know that it is the causative mutation we're looking for? To date, the Hunt for Disease Alleles in the genome has focused on only 2% of the whole. Those are the coding regions. Just 2%. This morning, I want to talk about the other 98% of the genome. Now, we typically ignore the half of this 98% that is represented by repetitive elements. But the remaining half likely holds the secrets to human disease. That's because this other half contains the regulatory components, which play a crucial role in delivering gene products in the proper spatiotemporal patterns, meaning through developmental time and within the necessary cells. Yet sequences containing the regulatory elements are so hard to decipher that have been called the dark matter of the genome. But that opinion is changing. Many mutations associated with diseases are being identified. And now we need to think forward to the future about how mutation discovery leads to a mechanistic understanding of disease. This will ultimately enable new therapies. So I really want to emphasize this point. The cell represents an interconnected set of events. Disease is not about a single mutation in a cell. It's about a cascade of events. These happen as a consequence of a mutation, and no event occurs in isolation. So therapies need to target processes that are derailed by mutations. And when the mechanisms of disease are known, optimal therapies will become obvious. We're not there yet, but it's an exciting prospect. For now, there's no map connecting the cascade of processes in disease. In other words, we've got a map of the genes like the celestial stars, but we don't yet know how these genes are connected, like the patterns and constellations. However, these patterns are emerging. And they're partly revealed by a new code, one that coexists with the raw sequence of the genome, known as the epigenetic code. That code regulates processes in the genome and is passed from cell to cell, but it isn't encoded in the DNA. Instead, it is carried along either as methylation marks that are placed directly on the DNA, or as methylation and acetylation modifications on the tails of histones that are wrapped by the DNA. And until we had the genomic sequence, we were absolutely blind to this code. The epigenetic patterns provide a landscape to the genome by identifying a cell type, distinguishing functional elements, and indicating gene expression levels. And ultimately, these patterns may be critical to diagnosing disease. Now, this morning, I'm going to describe some individual parts of the genome, but I'm going to emphasize how the genome is larger than the sum of its individual parts. In other words, just as an alphabet forms words and words form language, the processes in the genome depend upon and react to each other. So the sequencing of the human genome was just the beginning of the story. We have the sequence of 2.9 billion haploid base pairs. But determining what they do and how they contribute to disease is what makes our job so fascinating. Amazingly, we know the genetics behind some 3,000 Mendelian diseases. And many scientists predict that about 4,000 more will be discovered in the next few years. In contrast, the genetics of complex diseases currently evades description. And in the prevailing explanation known as the common disease, common variant hypothesis, many genes are needed to contribute very small individual effects, each slightly elevating the risk of disease. So I ask you, have we really learned nothing but probabilities from the genome, as some of the naysayers say? I think not. New information about the role of DNA in health and disease is discovered every day. The genome sequence has expanded the potential of genetic testing. And it's opened the door for developing new therapies that target specific and diverse causes of the same disease, such as some types of cancer, enabling treatments that can be tailored to each specific patient. Now, if you contrast that with the blanket, non-specific approach of chemotherapy, then the potential payout of the human genome project becomes palpable. So let's look at four important outcomes of the human genome project. They include comparative genomics, mapping of functional elements, the interpretation of disease processes, and reading the epigenetic code. Now, in technological terms, the human genome may not be very impressive. We know that it's three gigabases in size, but as an information storage system, how do you think that equates to other information storage systems, such as Mozilla browsers? Any guesses? The answer? Those three gigabases, when converted to bits and bytes, become the equivalent of 750 megabytes in binary code. In other words, the human genome equates to 2.8 Mozilla browsers, and it's smaller than Microsoft Office, but it's probably almost as buggy. Now, we like to think that the human species is the most advanced species on the planet, and this suggests that genomic complexity has been critical to our success. Yet the origin of that complexity is not obvious from studying the human genome itself. Genomically speaking, we fall somewhere between a chicken and a grape, and if you thought that the number of bases in the genome was significant, well, I'm sorry to tell you that there's not much difference between the person sitting next to you and a turkey or a lungfish. Underlying the genome's complexity is its ability to create diversity, and this diversity comes in a number of forms. For example, multiple non-coding elements are used in combination to define precise, yet adaptable, patterns for gene expression. The alternative processing of genes and the use of alternative promoters also creates extensive diversity in the genome. Furthermore, non-coding RNA, which was largely undetected until a few years ago, now plays a central role in the theory of multicellular complexity, carrying out a massive amount of regulation within the cell. So with this complexity in mind, maybe it's no surprise that the coding regions cover less than 2% of the genome. And the hunt is on to find altered sequences that impact normal regulatory functions to understand how they affect regulation, function, and speciation. And regardless of whether these mutations cause single gene diseases or are involved in complex multigenic diseases, they have subtle and far-reaching impacts that create diverse and unpredictable outcomes. So what do we know about the human genome so far? There are 1,400 elusive disease-associated variants in the genome that have been identified through genetic epidemiological studies known as genome-wide association studies, or GWAS. This is a statistical technique that compares thousands of patient samples to thousands of control samples and has identified sequence variants associated with the risk of Crohn's disease, heart disease, diabetes, and dozens of other conditions. The vast majority of these 1,400 GWAS variants lie in the non-coding regions of the genome. And they're providing us with enormous opportunities to discover what they do. So throughout this talk, we'll take a look at functional categories of the genome to further explain the steps you might consider to ascertain function at these GWAS sites. Now, probably the biggest impact of the human genome project to date is the ability to compare genomes at the nucleotide level. And that's because mutations in functional DNA are less likely to be tolerated and leave a signature of similarity across genomes. Early on, with only human and mouse genomes available, some of the most important insights were gained. For example, while some genes change rapidly, such as immune reproduction and olfaction, other loci containing developmental genes are so highly constrained that no base pair has changed in 75 million years of evolutionary time since the last common ancestor. It's pretty amazing. So how much overlap is there between the human and mouse genomes? Guess this? At the nucleotide level, approximately 40% of the human genome can be aligned to the mouse genome. These sequences seem to represent most of the orthologous sequences that remain in both lineages from the last common ancestor, with the rest likely to have been deleted in one or both genomes. And today, you'll find the genomes of 36 mammals and 10 non-mammalian vertebrates aligned in the genome browsers. You'll recognize that we use genomic diversity to uncover the story of human evolution because it refines the historical account of the early migration of humans out of Africa into the Middle East and beyond. And you've probably read that having the sequence of the human genome available was crucial to the discovery of unmistakable traces of Neanderthal DNA mixed into some modern human populations. So what percent of genomic DNA in people from Eurasia and the Southwestern Pacific was inherited in the Neanderthals? From Neanderthals. Between 1% and 4%. But did you actually know that this inherited material is not concentrated in genes? Instead, it's spread somewhat randomly in the non-coding portions of the modern human genomes. And doesn't it make you wonder what it contributes? Now, the Neanderthal data is available for everyone to view on the genome browsers as well. But let's not forget that getting to this point took integrated and cross-disciplinary approaches in biology, math, statistics, and computer science. And obviously, it wasn't so easy that a caveman could do it. So what is important in what we've learned? Taking a bottom-up approach, let's discuss some sequences in the human genome before talking about how these features fit together. There are millions of sequences in the human genome that perform essential functions. Broadly, they can be defined as protein-coding genes, non-coding RNAs, and regulatory sequences. One way to narrow the search for meaningful pieces of the non-coding genome is to align genomic sequences from species that are very different. For example, 46 species ranging from human to fish spans an evolutionary time of over 450 million years. This computational exercise allows us to identify patches in the sequences where functional words stand out against the background, somewhat like words embedded in a random alphabet. This similarity is caused by negative selection or purifying selection, which is the pressure to remove deleterious changes from a sequence. It is a strong signal of regions whose sequence has been maintained because it is critical for a biological function. Now, in the early days of genome sequencing, the initial calculations of the amount of the human genome under purifying selection were performed when only human and mouse genomes were available. The underlying prediction was that some proportion of the genome was more conserved than expected when compared to the level of conservation found in neutral sequences. So in other words, some proportion of the 40% of the genome that is alignable stands out as being preserved for functional purposes. So the 40% is represented by the blue curve. And when the alignable neutral regions were removed, represented by the red curve, the statistical estimate was that 5% of the genome was under functional constraint. And that 5% represented the biological functions that were common to the human and mouse genomes. Now, keep in mind that this is 2 times larger, 2 and 1 half times larger than the 2% estimate of the amount of coding sequence in the genome. So these perceived regulatory regions outnumber the coding regions. Now, although these statistical techniques could identify the number of bases under constraint, the approach could not say which bases were functional because conservation alone is in proof of function. And the statistics certainly didn't say what functions the regions perform. So scaling up and comparing multiple genomes, the regions under negative selection begin to stand out, known as phylogenetic footprints. And these are often a proxy for transcription factor binding sites. The use of additional genomes adds stronger evidence to find regions under purifying selection. It has also been shown that these functional words can be found using a collection of short evolutionary distances represented only by primate genomes. And this works as long as enough primate species are included. This approach is known as phylogenetic shadowing. And it has the benefit of finding functions that are shared only among primates. So some very extreme examples of negative selection exist in the genome. For example, have you heard of ultra-conserved elements? They are longer than 200 base pairs, and they have 100% identity with no insertions or deletions between human, mouse, and rat genomes. So the funny thing is, they were originally thought to be traces of contamination by human DNA in the mouse or rat samples. It's funny in retrospect. Maybe it wasn't so funny at the time. It was later shown that they actually belong in these genomes. And this extreme level of conservation correctly predicts their functional importance, and that can vary. They act as enhancers. They act as regulators of splicing or domains in protein coding genes. Now, the ones that overlap protein coding genes can be really interesting. They actually tend to serve a dual role. In one instance, they serve as the coding regions for proteins. And in another instance, they represent enhancer elements that regulate the transcription of neighboring genes. So today, the methods to detect constrained elements fall into two broad categories. These are generative, model-based approaches and bottom-up approaches. So an example of a widely used generative approach is available on the UCSC genome browser known as FAST-Cons. Have you seen FAST-Cons? This approach identifies the number of constrained sequences within a genome based on the known patterns of sequence variation among species. In other words, it calculates it all at once. And in contrast to the generative models, an example of a bottom-up approach for constrained element detection is GERB. This approach first estimates all constrained bases in the genome, and then clusters neighboring sites together to calculate a total percentage. So recently, GERB was used to predict the percentage of the human genome under functional constraint and found an additional 2%, bringing the total to 7% of the human genome. Now, it's likely that much more than 7% of the human genome is functional. However, these measurements look at constraint across diverse genomes. And therefore, they're bound to miss some things. Some constrained elements are missed because they are specific to human, or maybe they're conserved only in primate sequences, or some may not have conservation consistent with the way we measure it. In fact, one headline of the 2007 ENCODE paper published in Nature was that roughly 50% of the known non-coding functional elements seemed to be unconstrained across all mammals. The explanations were numerous. Perhaps the functions actually lacked biological assays. That chromatin accessibility might be more important than sequence composition. That the regions were, in fact, lineage specific. That they were functionally conserved, but non-orthologous elements. Or that they did not confer a selective advantage or disadvantage to the organism, and were treated as biochemically neutral features. And now you can understand why it's so hard to find regulatory elements in non-coding regions. Now, on the opposite side of the spectrum, some genes are under selection to increase the number of nucleotide changes in a particular species. And this is known as positive natural selection, the force that drives the increase in advantageous substitutions. So a well-known example, although at first glance, you might not consider this advantageous, is the sickle cell mutation in the beta-globin gene. In its heterozygous form, this mutation actually protects against malaria. So in this case, the blood cells are weakened by the presence of the sickle allele, causing them to be rapidly cleared from the body as the cells become infected. And this connection explains why the mutation has been retained and the incidence of sickle cell corresponds to regions endemic with malaria. Now, another example is a mutation in a regulatory region near the gene for lactase. And this allows lactose tolerance to persist into adulthood. And studies show that this particular variant was apparently selected in parts of Europe as the domestication of cattle was taking place. So I don't know about you, but I'm thankful because I love Dairy Queen. In 2006, a collection of transcripts with accelerated divergence was identified and called Human Accelerated Regions, or HARS, for short. They were conserved across species but had accelerated changes in human relative even to the chimp sequence. Up to 1,000 of these regions have been described. And some HARS are thought to have contributed to the emergence of human neuroanatomy, language, and complex thought. One example contains a gene enhancer that targets expression to the opposable thumb and may be responsible for modifications in the ankle or foot corresponding to bipedalism. Note, though, that hidden processes in the genome have complicated the discovery of accelerated regions. While positive selection is a measure for accelerated evolution, the presence of accelerated evolution is not in itself proof it was caused by positive selection. So for example, there's another process known as bias gene conversion. And this changes regions that are rich in A's and T's to regions that are rich in G's and C's. And this phenomenon actually explains several of the HAR regions rather than positive selection. So now three forces must be considered to interpret the fastest evolving regions of the genome. These include positive selection, bias gene conversion, and the relaxation of negative constraint below the average background level. And all of these would create an excessive number of changes. In relation to the global change, what is that accelerated change? Accelerated changes are typically measured on a local basis. But when I say local, local tends to be fairly large in the genome, a megabase or so. And so it's calculating the background level of changes and then looking in a single species to see if there are an excessive number of changes. Sure, the rates will vary. And really the only ones that can be easily detected are those that are well above the background level. So a second very important contribution of the genome project is the mapping of functional elements. Although comparative genomics got us very far in our understanding of the basic makeup of the genome, it says nothing about the activity of a locus in a given cell type. Promoter and enhancer activities differ by tissue, reflecting dynamic processes that act on the static sequence. And the need to look at dynamic features within cells has fueled the rise of chromatin immunoprecipitation coupled to next generation sequencing. And under the tidal wave of data, the dark matter is rapidly giving way to the encyclopedia of DNA elements that lie within the non-coding regions of the genome. This information is narrowing the search for mutations that cause disease and identifying more functional elements than previous methods that used sequence conservation. Now I know you're all familiar with the static landscape of a cell where there are genes composed of exons separated by introns. Every gene has at least one promoter, which extends some length upstream and is typically divided into the core promoter, the proximal promoter, and even an extended promoter. These are the recruitment and docking sites for the transcription machinery. And moving further away, we find elements such as enhancers or silencers and insulators, which modulate the transcriptional activity. They perform functions through direct contact with the promoter. Most likely through three-dimensional looping. And if we were to look more closely at the gene body, introns sometimes carry enhancer or silencer elements, either for the same gene or for neighboring genes. And as I've mentioned, often alternative promoters are embedded within the body of the gene to provide specialized expression patterns or deliver unique forms of the protein. So let's focus on enhancers for a minute. Looking at the sequence level, we know that enhancers are composed of clusters of transcription factor binding sites, each of which are six to 20 base pairs long, and they bind both activators and repressors. Evolutionary sequence conservation is typically present. If we look at the epigenetic level, these transcription factors are recruiting chromatin remodeling complexes that facilitate open chromatin and leave a signature of DNAs one hypersensitivity. This results from the repositioning of nucleosomes to allow transcription factor access. The chromatin remodellers act by targeting specific histone residues and transferring the acetyl or methyl groups to create the histone code. So in active enhancers, you'll find the marks of H3K4 mono-diamethylation and H3K27 acetylation. You'll also find the protein P300 present. So identifying enhancers is becoming a solvable problem and they are being found just about anywhere. They're adjacent to genes, they're embedded within coding axons of genes, they're in introns of unrelated genes, they're even mega bases away from their targets, and they have been found in regions known as gene deserts. So one important component of enhancer function is physical interaction with the promoter. And DNA looping completely takes us out of the realm of sequence analysis and into the 3D structure of the nucleus. So you've perhaps heard of experiments used to detect looping interactions. They include chiapet, 3C, 4C, 5C, and now high C. And these techniques identify enhancer promoter interactions. And when all of this data is coupled together, the data from long distance interactions, gene expression and chip seek assays, tell us the who, what, when, and where of regulatory function. So the third crucial contribution of the genome project is the ability to interpret disease processes. Given the important role of enhancers in gene expression, it's not surprising that mutations in enhancers cause genetic defects. In the case of the sonic hedgehog gene, a gene involved in the formation of digits in mammals and fins in fish, gain of function mutations in a distal enhancer create extra digits, whereas loss of function mutations cause limbs to be truncated. So gain of function mutations can actually be pretty cool to study. Here's an example of a mouse line that can regenerate lost tissue. While it's currently not known precisely how this is done, the hope is that the regulatory information could eventually be used to regenerate lost limbs. And it just makes me think some amazing regulatory pathways must be involved here. So speaking of a role in disease, we've all heard of the common disease, common variant relationship in which multiple weak acting variants combine to cause complex diseases. But did you know that this idea might apply directly to enhancers? So an emerging hypothesis about enhancers is that their modular construction could allow the accumulation of mutations that carry a modest effect of disease risk. This connection makes enhancers excellent candidates for a role in causing common diseases. And now the challenge is to determine which variants affect function and which do not. So in that same vein, several GWAS variants involved in type two diabetes, colorectal cancer, breast and pancreatic cancers, and even coronary artery disease now appear to reside in enhancer elements. So what about the rest of these GWAS variants? Actually, the majority of the GWAS variants. This is where it gets challenging. Because enhancers aren't validated across the genome, how do we know if a variant disrupts a functional element or is neutral? Does sequence conservation and phylogenetic footprinting provide evidence? Some. Do histone modifications in DNA's hypersensitivity indicate function? Yes. Do P300 and looping interactions show activity? Of course. So primarily to find this evidence, these are a lot of different experiments to have to do. We look to resources like ENCODE. But please keep in mind that these epigenetic patterns are cell type specific. And this is such an important point. Since ENCODE has only looked at a subset of features in a subset of all tissues, most information is still lacking for specific cell types involved in diseases. If that cell type hasn't been studied, the evidence of function might not exist. Now, one big caveat to keep in mind is that the GWAS variant might not be the causal variant in disease, even though the statistics say so. And this is due to the relationship among variants in a region. In other words, any variant in the region of green, the region of linkage disequilibrium, defining the family of variants that travel together could be the culprit. So some fraction of GWAS variants don't actually fall at the functional site, but they are related to one that does. So I'll come back to enhancers in a few minutes as we discuss specific chromatin modifications. For now, I wanna discuss the idea that functional elements in the genome are interconnected through chromatin. So one way to holistically assess a cell is to look at nuclear interactions. And as we all learn in introductory biology classes, eukaryotic chromatin structure can be viewed as a series of superimposed organizational layers. At the root is the DNA sequence, which undergoes higher order packaging being folded into nucleosomes and creating the fundamental units of chromatin. With approximately 147 base pairs of DNA wrapped around an octamer of histone proteins, including two copies each of H2A, H2B, H3, and H4. Now in the nuclei of mammals, the most obvious form of structural genome organization is the compartmentalization of chromatin into eukromatin and heterochromatin. Eukromatin is open chromatin. It is accessible to transcription factors and is represented by the beads on a string model of DNA. Heterochromatin is closed chromatin, packaged into the condensed form of 30 nanometer fibers using histone H1. Now one of the most important recent discoveries in the field of genome biology has been the demonstration that genomes are non-randomly organized in the nucleus. So each of these colors are representing individual chromosomes and the area that they occupy is known as the chromosome territory. So in addition to these chromosome territories, transcriptionally silent heterochromatin, which contains gene poor chromosomes or regions with low overall transcriptional activity are localized to the nuclear periphery. The gene rich chromosomes and the highly transcribed eukromatic DNA is situated in the center of the nucleus. So oops, what makes this really interesting is that some of these localization patterns are cell type specific, not only at the level of the chromosomes, but also at the level of the genes. And in that case, their repositioning reflects the genes activation or repression status. So the holy grail in this field is to elucidate what mechanisms determine where a gene or a chromosome localizes within a cell nucleus. Another goal is to easily detect when a mutation has interfered with nuclear positioning of a gene compared to the positioning of the normal allele. So what remains missing? What we desperately need in descriptions of gene regulation is to integrate information about how nuclear organization contributes to regulatory interactions that occur between chromosomes. So a well-known example is the activation of the interferon beta gene on human chromosome nine. This requires physical interaction with regulatory elements on chromosomes four and 18. How often these interchromosomal events occur throughout the genome is not known, but it might be predicted using these new techniques offered by high C that detect all interactions in the genome. Now in the silenced region at the nuclear periphery, important proteins include the lamin proteins. They form an intranuclear scaffold known as the lamina around the edges of the nucleus. This lamina supports the nuclear architecture and helps to organize nuclear processes such as DNA and RNA synthesis. So how do we confirm that laminar interactions are important? Find mutations that cause disease. Look at the presence of sequence conservation in lamina interacting domains or delete lamina structures. I'd go with the first one. Mutations that cause disease. Sequence conservation is not often present in lamina interacting domains. However, there are tracks on the UCSC browser using chip type assays that are now identifying broad regions of laminar interacting domains. So for example, Hutchinson-Gilford-Pergeria syndrome is a premature aging syndrome with symptoms of early cessation of growth, baldness at the age of two, progressive degeneration of the skin, muscle and bone, and often fatal atherosclerosis in childhood. And in this disease, the architecture of the cell nucleus is abnormal. The mutations associated with this disease occur in the protein lamin A. Without lamina function, nuclear organization is lost. So is transcription and transcription-coupled repair that corrects DNA damage. And it's this reduced ability to mount a response to DNA damage that triggers cell death and senescence, thereby promoting accelerated aging. So Groucho Marx had this great line about aging, saying that aging is not a particularly interesting subject. Anyone can get old. All you have to do is live long enough. And in many ways, the processes of aging are parallel to processes seen in disease. For example, chromatin is altered during aging. There are phenotypic outcomes, such as general heterochromatinization and telomere shortening, which correspond to and promote a decrease in the repair of chromatin aberrations. Other events like the loss of ribosomal DNA repeats through looping and excision, loss of DNA methylation, and the redistribution of histone H1 all create genomic instability. Instability leads to cell death and aging under normal circumstances and to cancer in cases of disease. So let's continue with this idea of interpreting disease processes. Several disease examples come from splicing mutations that reveal hidden regulatory elements. And what I mean by hidden is that they are embedded within the coding sequences. A portion of these mutations happen directly adjacent to the splice junctions and they change the splice boundary information. The rest occur in central positions of exons that contain specialized splicing regulatory elements known as exonic splicing enhancers or ESEs and exonic splicing silencers or ESSs. Also a similar set fall outside of the coding regions. These regulatory elements are known as intronic splicing enhancers, ISEs or intronic splicing silencers. These sequences are the binding sites for RNA binding proteins that regulate splicing. And therefore their nucleotide sequences serve two purposes. In the first, they code for amino acids in the DNA. In the second, they serve as regulatory sites for splicing proteins in RNA. And this explains why synonymous substitutions that don't affect the DNA, that don't affect the protein coding region can disrupt splicing. Now there are well-known examples of coding mutations that disrupt splicing enhancers and cause an exon to be excluded from a transcript. And this happens in cystic fibrosis, in breast cancer, and spinal muscular atrophy. And often the consequence of a skipped exon is a change in the open reading frame of the protein, leading to a loss of the protein function. So think about it. If this happens in a master regulatory factor, the regulation of all downstream targets would also be altered. Now altered splicing happens in the nuclear lamina protein, laminae, causing partial skipping of the 11th exon. And can you guess the outcome? The altered structure of laminae changes its protein function along with nuclear architecture, leading to the impaired physiological functions seen in progeria. This is how a single nucleotide change can affect the course of a lifetime. To find regions, oh, the splicing regulators. Very good question. So the question was, is there a source to find these regulatory elements? Right, that's the question, yeah, okay. So your question addresses the current genomic challenge. And that's that the landscape of these ESCs and ESSs, as well as the ISCs and the ISSs, is not well mapped. And also the consequences of mutations in these elements are not easily predicted. And even worse, the extensive amount of non-coding space to search is actually prohibitive. And so much more work is needed in this area. But this is one of my favorite areas. So my group has developed a tool called Skippy to predict mutations that are likely to cause exonskipping. And it's available on the NHGRI website. And I welcome you to go and look at it. It has a very easy user interface. You simply need a chromosome number, a chromosome coordinate, and the alleles involved in the change. Again, it's only a predictive tool and any predictions need to be experimentally validated. So this tool was built on the knowledge we gained from comparing characteristics of known exonskipping variants to neutral substitutions that don't affect splicing. And using this tool, we've identified new synonymous substitutions that dramatically affect splicing in disease genes. Keep in mind, though, that this tool only addresses mutations in coding regions. Mutations that fall in intergenic or intronic regions are still an open question. That's correct, you won't get an answer. There are other tools out there that map the binding matrices of these proteins. So you could put in your intronic sequence and get a prediction of what proteins recognize that sequence. You could then put in your mutant sequence and see if those combinations of proteins have changed. So there's another connection in the genome, and I don't know if you've ever made this connection. It's that the size of a DNA associated with nucleosomes, about 147 base pairs, is nearly the exact length of an average exon. And the current ideas are that exons may have evolved to this size to facilitate their identification and splicing. And in this case, the nucleosomes serve as speed bumps, slowing the RNA polymerase as it crosses the exon and buying time for the splicing factors to work. So what's the prevailing connection between functional elements in the cell? Is it that aging cells sabotage each other? That non-coding RNA orchestrates many events, or that conserved elements underlie all important features? So let's talk about non-coding RNA. The answer is non-coding RNA. Although the conventional view of gene regulation invokes the central dogma of DNA to RNA to protein, much of the RNA in a cell is never translated. However, this RNA is actually very important. And evidence from genomic studies indicates that organismal complexity is largely due to regulatory functions contributed by non-coding transcripts in the genome. So for example, link RNA are transcribed regions greater than 200 nucleotides in length. They are often poorly conserved, suggesting they might have some lineage specific functions. And they regulate gene expression by diverse mechanisms that are not yet fully understood. It's also possible that they are conserved among species in their functions and they are conserved in their secondary structures but not conserved at the level of the nucleotide sequence. Now tens of thousands of examples are thought to exist in the genome. And recently four roles of link RNA have been proposed. The first, they serve as guides. Link RNAs can recruit chromatin modifying enzymes to target genes either in cis or in trans. In the second, they serve as signals. Link RNA indicate gene regulation as it happens. In the third, they serve as decoys. Link RNA can titrate transcription factors or act as decoy targets for binding proteins. And in the fourth, they serve as scaffolds. So link RNA bring together multiple proteins to form RNA and protein complexes that direct histone modifications. Given these important roles in the cell, it is not surprising that misregulation of link RNA is associated with disease. Studies from the lab of John Rinn are identifying candidate oncogenic link RNA and tumor suppressor link RNA. Now like link RNA, micro RNA are a class of non-coding RNA but with much smaller size. These RNA transcripts are generated as 70 base RNA hairpins from within spliced introns or from non-coding regions of the genome. The hairpin structure is processed through the action of specialized enzymes to create a 22 base message that targets the three prime UTR of messenger RNA and causes either degradation of the primary transcript or interferes with its translation. There are over 1,000 micro RNA identified and their specific expression patterns are now being used as a diagnostic in cancers. So the fourth area in which the Human Genome Project has made significant contributions is our ability to read the epigenetic code. That's a very good question. So the statement was that many genes are targets of the same micro RNA and micro RNA target many genes. So there are an enormous number of interconnecting networks that are regulated by micro RNA. And so the difficulty in this question of how many genes are targeted by micro RNA, probably the vast majority of them, we don't actually know the precise answer to that question. And that's partly because of the way micro RNA act. So they have very specific recognition patterns. Typically in the three prime UTRs, there are some variations on this theme. And also there are some new sites being found within the bodies of genes. So until we know what all the target sites are, it's gonna be pretty difficult to address that question. But if you're thinking in terms of regulatory networks and you're thinking in terms of transcription factor binding sites, you should also be thinking in terms of micro RNA target sites. Quite reasonable thinking. I think that's quite reasonable. And there are tools out there to predict micro RNA binding sites. Some of them use sequence conservation in those target sites. Others use the precise patterns recognized by the micro RNA due to sequence complementarity. Sometimes micro RNA, when they start out in their hairpin structure, they're double-stranded. Sometimes each strand can also specify a different target. So within the last five years, we've all lived through a monumental shift. And I think you've probably recognized it. And this is the fact that we've gone from a gene-centric paradigm to a genome-scale perspective. So in the book, the immortal life of Henrietta Lacks, the author, Rebecca Skloot, wrote, good science is all about following the data as it shows up, letting yourself be proven wrong, and letting everything change while you're working on it. And this is actually very true. In fact, chip experiments are not classified as hypothesis-driven science. They are hypothesis-generating science. And they are changing the face of our biological understanding by enabling new paradigms to be described. While I'm not showing them all here, there are over 150 different epigenetic marks catalogued, and the question remains, how many are necessary to describe the chromatin code? So the most extensive study that I've seen used 51 chromatin states. It subdivided them into combination and based on the statistical occurrences defined regions as being either promoter-associated, enhancer-associated, or repressed. And these results are striking because they indicate the high combinatorial complexity and the information content possible within epigenetic patterns. And of course, they give us the ability to annotate functional elements in the genome. But what's so exciting is that they give a readout of the dynamic state of those elements. So how do you tell a promoter from an enhancer? Active mammalian promoters typically have histone H3, lysine 4, mono-dye, and tri-methylation at the transcription start site. They also have DNA hypersensitivity, the presence of RNA-Pol2, and typically you'll find a CPG island. There are also specialized histones and the presence of histone acetylation marks. As we move away from promoters, we find that P300 is specifically localized to enhancers. A recent CHIP study found over 55,000 candidate enhancers with P300. These enhancers had H3K4 monomethylation and they were depleted for H3K4 tri-methylation. So this is a big difference between enhancers and promoters. H3K4 tri-methylation. In contrast to active promoters, repressed promoters have specific identifying marks as well. And these include H3K27 tri-methylation, which is the prototypical mark of polycomb group repression. They also have H3K9 tri-methylation, which correlates with constitutive heterochromatin and DNA methylation. Now, polycomb proteins are important in negative regulation and their overexpression can cause disease. So for example, polycomb proteins are upregulated in some cancers such as melanoma, lymphoma, breast, and prostate cancer. And in those cases, it is possible that their excessive repression causes the de-differentiation of cancer cells. So in that case, a solution that has been proposed is to interfere with polycomb functions as a therapy. Is polycomb always a sign of repression? Typically, yes. And that's because polycomb proteins are transcriptional repressors, showing dependent co-localization with the H3K27 tri-methylation. But in about 20% of genes in undifferentiated ES cells, a strange phenomenon occurs. In these cells, the activating mark H3K4 tri-methylation occurs with polycomb repression. And therefore, these loci are described as being bivalent. Although the genes are largely inactive in the undifferentiated ES cells, they are marked for possible future activation and they remain flexible, contingent with the needs of the differentiated cell. So taking this back to diseases, some tumors carry these repressive marks at these loci when they shouldn't. In other words, they've never been activated. They've never lost the repressive marks. And the hypothesis is that the repressive marks have been retained abnormally preventing the cell from attaining its differentiated identity. And some have speculated that these cells represent products of stem cells, known as tumor stem cells. So the histone code extends into the body of transcribed genes where H3K36 tri-methylation and H3K79 dye and tri-methylation are elevated. There's also monomethylation at several histone tails, as well as extensive acetylation throughout the gene body. So all of these signals can be used to identify the collection of actively transcribed genes within a cell and to define that cell type. What's really exciting is that the more precisely these signals become understood in the promoter and the gene body, the more predictive power we have regarding the level of gene expression. In other words, we are learning not only to read the code, we are learning to interpret its meaning. One important protein that is central to most regulatory processes is CTCF. It functions at tens of thousands of sites in the genome as an insulator and a boundary element. CTCF is heavily involved in regulation at the CFTR locus. And if you look at the linear view of the CFTR locus, it looks completely different from the three-dimensional view. In the three-dimensional view, CTCF supports and stabilizes long-range DNA interactions that form loops by interacting with additional proteins such as cohesin. These are the types of interactions that are being identified by those looping assays, the three, four, five Cs and the high Cs. So when you see this, then it's easy to understand why CTCF is so important. It's partitioning the genome into distinct regions of open and closed regulatory domains. So although we focus mostly on histone modifications today, another aspect of the epigenetic code is conferred by DNA methylation patterns. Like the histone code, these are specific for a tissue type. They stably alter gene expression patterns frequently by causing gene silencing, but not always. They suppress the expression of integrated viral genes and they prevent genomic rearrangements caused by repetitive elements and they play a crucial role in the development of many types of cancer. So let me finish up by showing you a comparison of tumor and normal samples generated in my lab. Here's a picture of DNA methylation displayed as a heap map. The blue color indicates no methylation and the yellow represents strong methylation. There are about 500 promoter regions on the X-axis and about 50 different samples on the Y-axis. So there are two types of normal samples shown here, uterine and fallopian tube. And you'll see that the tumors of the uterus look very different in their patterns. Oops, sorry, the tumors of the uterus down here look very different in their methylation patterns from the normal uterine tissues. You'll also see that these sites look very similar to ovarian tumors of the same morphological subtype. Those also look very different to their normal comparison which is actually to normal fallopian tube samples. So notice also that in the normal samples, CPG islands are not methylated, that's very typical, and that non-CPG islands carry methylation marks. You'll see that those patterns almost completely shift in these tumors where there is strong methylation in the CPG islands and a loss of methylation in the non-CPG islands. So what we hope is that the genes with heavy methylation should be pointing us to a common pathway or enzyme that is responsible for this targeted methylation. That's correct. It will vary. So the question is, can we derive a common methylation pattern for all ovarian tumors? In the case of ovarian tumors, the answer will be no because there are multiple subtypes and each subtype will have a different methylation pattern. And you can actually distinguish the subtypes because they have morphological differences and from all of the data that I've seen, those morphological differences correspond to differences in methylation patterns just as they correspond to differences in gene expression patterns. So I told you that the naysayers were wrong about the importance of genomic sequencing and I hope I've convinced you. The human genome sequence has advanced our understanding of evolutionary diversity among species. Furthermore, it is essential to figuring out how genomes function and what the important elements are and it drives research of variants that disrupt function in complex disease. Predictions of the future of medicine indicate that we'll all have a sequence genome in the next decade, perhaps on a smart card that we carry around. Knowledge of the variants that create subtle changes in our gene expression patterns could help doctors predict how we respond to drugs, what diseases to vigilantly screen for and just think it could even help us pick our significant others. With that, I'd like to end. Thank you for your attention. Please let me know if you have any questions.