 Good morning. Can you hear me? Thank you very much for that wonderful introduction, and thank you very much for joining me here today to talk about regulatory and epigenetic landscapes. I'll start by saying I have no commercial affiliations to report. I want to start with this quote by Carl Sagan, where he says, the nitrogen in our DNA, the calcium in our teeth, the iron in our blood, the carbon in our apple pies were made in the interiors of collapsing stars. We are star stuff. Just as the universe is the focal point of astronomy, the human genome is the focal point of biology. It represents a universe to biologists. The physical universe is vast and unexplored. By comparison, the genome is vast, yet invisible, and after 60 years of exploration and full sequencing, it retains secrets that haven't been explained. In 2003, the Human Genome Project laid the foundation for the future of medicine. As scientists directly involved in disease research, you and I are likely to use genomic sequence data every day. And yet, how likely are we to see a piece of genomic sequence and know its function? Or if it contains variants that contribute to disease? These are the questions facing biomedical research today. Now that the genomic blueprint has been generated, biologists need to characterize genome composition to understand the inner workings of the genome. They need to assess the impact of sequence alterations and address the biological and environmental causes of disease. We can begin to illuminate the relevant components of the genome by discussing functional elements, histone modifications, DNA methylation, as this outline enumerates. And the question of what's next belongs to each and every one of you. As you pursue important biological questions with relevance to health and human disease. Think about it this way. If Facebook can get 1 billion hits in 24 hours, how far off can the estimate of 1 billion genomes be? Age of omics. Entire collections of functional elements can be examined comprehensively, through computational and experimental means. You've heard the catchphrases genomics, proteomics, transcriptomics, epigenomics. What this slide lacks is emphasis on the critical need for integration across datasets, cell types, and time points. Otherwise, these snapshots of the cell only give a static view of the dynamic system. In fact, much power can be gained by comparing changes in large scale data to derive the action and response system of the cell. So let's start with genome composition. Winston Churchill has a quote where he says, now this is not the end. It is not even the beginning of the end, but it is perhaps the end of the beginning. This is particularly appropriate for studies of the human genome. We have the sequence of 3.2 billion haploid base pairs, 22,000 protein coding genes, and we know that the genome is 98% non-coding sequence. Clearly, we cannot ignore the 98%. Repetitive elements comprise 48%, and another 50% is unknown. Determining what these unknown sequences do and how they contribute to disease is what makes our outlook as biologists so fascinating. It's no secret that regulatory regions outnumber protein coding regions. And in this talk, I will show you that regulatory sequence can outnumber coding sequence by four or five times. But predicting regulatory content and finding regulatory elements are two different challenges. These mysterious sequences have been referred to as the dark matter of the genome, analogous to the unseen matter and energy that inspired Carl Sagan in his study of the universe. Explain what I mean. We like to think that the human species is the most advanced species on the planet. This suggests that genomic complexity has been critical to our success. Yet the origin of that complexity is not obvious from studying the human genome itself. For example, if you thought that the number of bases in the genome was significant, then I'm sorry to tell you that by that measure of complexity, the person sitting next to you falls between a turkey and a lungfish. And in terms of the number of genes in humans, we fall somewhere between a chicken and a grape. So while the complexity isn't controlled by total sequence or gene number alone, it is now up to genomic research to determine how the dark matter makes us special. One place to see complexity is in disease studies. For example, diseases are monogenic or multigenic. Somewhat amazingly, we know the mutations behind 4100 Mendelian or single gene diseases. These diseases have a devastatingly simple origin whereby a single genetic misspelling cripples a protein. Finding the faulty gene means that you've instantly revealed the biological story of the disease. But very few of these disease mutations are found outside of coding regions. A few examples are starting to appear in enhancers and other non-coding regions in small numbers. In contrast, the genetics of complex diseases caused by the effects of multiple genes, lifestyle and environmental factors currently evades description. And with contributions from hundreds of genes, the risk for disease is only slightly elevated by any given variant. Therefore, understanding what these variants do and how they may interact to cause disease is the goal of current research. Since transcription of protein coding genes is required for their function, you might hypothesize that any transcribed region could carry function. A study of transcription across the genome, comprehensively covering over 100 cell types, indicated that the total amount of the genome with evidence of transcription was 70 to 80 percent. So while we don't know what the function of that transcription might be or whether it has any function, it sheds new light that the dark matter of the genome is largely transcribed. This means that even the darkest recesses are shuttled through the dynamically active system. This suggests that the complexity of the human genome could easily be hidden in the recesses of the dark matter that occupies each cell type. Yet it's not totally, it's not the totality of transcription that's important, but the diversity of products that are produced, diverse cell types, diverse regulatory elements, alternative splicing, alternative promoters and non-coding RNA. So perhaps the adage that one man's junk is another man's treasure applies to individual cells, where human complexity could be born. Is it possible that one cell's junk DNA is another cell's dark matter? We know that mutations cannot be tolerated because they interrupt functional elements, and we know that sequence variants can be tolerated because they are found in one out of every 100 to 300 bases. Interpreting the functional impact of sequence changes depends on your perspective. It requires you to think about the genome in a cellular context. You are familiar with the linear genomic sequence illustrated in every textbook on molecular biology. This consecutive nucleotide display conveys the content of the genomic sequence, but overlooks the necessary packaging. We often describe DNA folding as a convenience to cram nearly six and a half feet of linear DNA into a small sphere. This feat requires tremendous compaction. But we don't often as here, we don't as often hear about how the architecture of the folding facilitates interactions along the DNA. These interactions can join regulatory elements and gene targets that are one million bases apart. Like strands of a tied shoelace, loops bring distant genes and gene control switches into close proximity. Consider then DNA wrapped up into a folded structure in order to allow inter chromosomal contacts. Suddenly the positioning of chromosomes is relevant to their functional instructions. Sequences at the periphery of the nucleus are transcriptionally silent and sequences near the interior are transcriptionally active. What we fervently want to know is how does this cellular topology affect diseases and how do diseases change this topology? So as you think about the genome and you explore the impact of a single nucleotide variant, ask yourself, is it affecting the linear flow of information, the looping interactions of gene regulation or the 3D spatial positioning and dynamic responsiveness of the chromosome? Since the 1950s, when Francis Crick described the central dogma, a laser like focus has been placed on finding mutations that affect protein function. Now with whole exome sequencing, these are rapidly being hunted to completion. And in the last decade, the recognition of the vast amount of non-coding RNA started to expand the scope of functional elements in the genome. We know now that as much as 50 to 90 percent of the transcriptome has no protein coding potential, but rather represents an important class of regulatory molecules responsible for orchestrating gene expression. This opens up a new search for disease causing mutations that are present in the non-coding sequences, a much larger target that requires whole genome sequencing to address. So returning to the issue of complexity in the human genome, it has been argued that the extent of long non-coding RNA is correlated with organismal complexity, especially in processes related to brain function. Then it is no surprise that link RNAs have roles in cancer, diabetes, hypertension, as well as psychiatric disorders, including schizophrenia, bipolar disorder, depression, and autism. Let's begin with the linear sequence. One way to narrow the search for meaningful pieces of the non-coding genome is to align genomic sequences from species that are very different. For example, of the 46 existing genomes from species ranging from humans to fish, they span an evolutionary time span of 450 million years. Clues to function are found in regions that stay the same between species, as well as those that change. For example, relationships can be depicted using a phylogenetic tree. Written on the tree are important questions that can be addressed by particular comparisons, such as the origin of therian sex chromosomes X and Y. To address this question, you would look at differences in placental mammals and marsupials versus monotremes. The question of changes accompanying pig domestication is focused on comparisons of wild and domestic pigs. And human language evolution involves comparison of Neanderthals, Denisovans, and modern humans. Transposable element exceptation or co-option into functional regulatory elements started with comparisons of chickens and lizards and expanded to show that 10% of transcription factor binding sites in the human genome were deposited by mobile elements. So for each of these questions, the answer requires having sequences separated by the appropriate phylogenetic time span. Now looking directly at sequence alignments in mammals provides clues to regulatory elements. Here are five species aligned to show that regulatory elements of the beta globin enhancer region. Boxed areas represent known or predicted transcription factor binding sites. Dots represent identical sequences across species and dashes represent gaps. This computational exercise of finding regulatory conservation is called a phylogenetic footprint. The similarity is caused by negative selection, otherwise known as purifying selection. This is the pressure to prevent or remove deleterious changes from occurring within a sequence. It is a strong signal of regions whose sequence has been maintained over time because it is critical for a biological function. This image illustrates how sequence conservation represents functional regions. Now at closer distances like in primates, comparative approaches might reveal that functional and neutrally evolving sequences remain too similar to be distinguished. But functional words can be found in regions where no changes occur, as shown here by the black lines, as long as enough primate species are included. These words often reveal primate specific regulatory elements. You'll note that the boxes show regions where the sequences have changed. This approach is known as phylogenetic shadowing, and it can be used to study any differences within any lineage, as long as enough sequences are available. Now in the early days of genome sequencing, only human and mouse genomes were available. They were used to calculate the total amount of the human genome under purifying selection. This could be used as a proxy for how much of the genome was functionally important. In total, 40% of the human genome aligned with the mouse genome, as shown by the blue curve. The vast majority of this was neutrally evolving, as shown by the red curve, and carried low sequence similarity. The statistical estimate of the sequence conserved above the neutral rate was 5%, as shown by the gray curve. Now keep in mind, this is at least two times larger than the estimate of coding sequence in the genome. And as conserved regions became the focus of function, this approach could say that there were regions in the genome that appear to be functional, but it wasn't proof of function and it certainly couldn't say what those functions were. Now some very extreme examples of negative selection exist in the genome. For example, ultra conserved elements are longer than 200 base pairs and have 100% identity with no insertions or deletions between human, mouse, and rat genomes. They were originally thought to be traces of contamination by human DNA in these other genomes. It was later shown that these sequences actually belong in these genomes. This extreme level of conservation vehemently predicts functional importance. And today we know that ultra conserved elements throughout the genome can serve as enhancers, regulators of splicing, as well as axons of protein coding genes. What's really interesting about ultra conserved elements is that some of them carry a dual regulatory purpose. They are axons in protein coding genes and they are enhancers for neighboring genes. So the search for functional elements required new exploration of the genome beyond the coding regions. Donald Rumsfeld has a quote. There are known knowns. These are the things we know. There are known unknowns. These are the things we do not know. There are also unknown unknowns. These are the ones we don't know. We don't know. It is the latter category that tend to be the difficult ones. This is not a picture of Donald Rumsfeld. This is a picture of Captain obvious. But the question of what fraction of the human genome is under purifying selection has remained contentious ever since those first estimates in 2002. At that time it was not well appreciated that the amount of human constrained sequence that is also constrained in mouse is a minority of all human constrained sequence. This is because there is relatively rapid gain and loss of functional sequences in both lineages since their last common ancestor. In other words at that time it was an unknown unknown. By examining genomes of multiple species the constrained sequence could more easily be seen despite losses in some species. And that's what's being shown here. The saturated colors show us the fraction of constrained sequence that has been retained whereas the pastel colors show how much has turned over. Based on this study the amount of the genome under negative selection to preserve function is estimated at 8.2 percent. This is nearly four times the amount of protein coding sequence. The opposite end of the spectrum some regions are under selection to increase the number of nucleotide changes in a particular species. This is known as positive natural selection the force that drives the increase in advantageous substitutions. A well-known example although you might not consider this advantageous is the sickle cell mutation in the beta-globin gene. In its heterozygous form it actually protects against malaria. And that's because infected blood cells are greatly weakened by the presence of the sickle allele causing them to be rapidly cleared from the body. This connection explains why the mutation has been retained and the incidence of sickle cell corresponds to regions endemic with malaria. Another example is a mutation in a regulatory region near the gene for lactase. For around 10 percent of Americans 10 percent of Africa's Tutsi tribe 50 percent of Spanish and French people and 99 percent of Chinese. A tall cold glass of milk means an upset stomach. Most of the adults in the world are lactose intolerant and cannot digest lactose the primary sugar in milk. And yet regardless of our ancestry most of us begin our lives happily drinking milk. So what happened in that intervening time. In people who become lactose intolerant adults the lactase gene is switched off after weaning. Recent evidence suggests as African and European populations began hurting cattle lactose tolerance became an advantageous trait. These mutations were favored by natural selection and quickly spread through dairy dependent populations. And this is what allows us to enjoy our daily lattes. Now rapidly changing regions in the human genome provide a window on acquired mutations that provide a favorable trait. In 2006 a large scale study was undertaken to identify all transcripts with accelerated divergence called human accelerated regions or horrors for short up to 1000 regions have been identified. They are conserved across species and some have accelerated changes in human relative even to chimp. One example of an accelerated change occurs in a special enhancer sequence that targets expression to the opposable thumb. It may also be responsible for modifications in the ankle or foot that correspond to walking upright. What this figure shows is that when the human sequences were placed in a transgenic mouse they could drive expression that supported the hypothesis of the opposable thumb. When the mouse sequences were used they could not support those expression patterns. So given all the non-coding sequence under selection perhaps it is even more impressive that many years ago in 1975 Mary Claire King published a paper that stated the evolutionary changes in human anatomy and way of life are more often based on changes in the mechanisms controlling the expression of genes than on sequence changes in proteins. How prophetic that statement would turn out to be. And with it we continue exploring the non-coding dark matter of the genome. So Winston Churchill has another quote out of intense complexities, intense simplicities emerge. I've just told you how complex the genome can be. Could it be possible to describe it in simplistic terms? That's what we would all like right? I guess my answer is it depends on how you define simplistic. In other words the rules and patterns define ways to reduce complex items into straightforward examples. Following the completion of the genome project the linear sequence could be used to identify these rules and define regulatory elements. From their experimental techniques and computational approaches have been used to predict functional regions. The basic descriptions of regulatory elements and the mapping of proteins that bind to them helps us to define these rules. We can define enhancers as clusters of transcription factor binding sites that are each 6 to 20 base pairs long. Evolutionary sequence conservation is typically present and enhancers are being found just about anywhere. They are adjacent to genes embedded within coding exons of genes, in introns of unrelated genes, megabases away from their targets and even in gene deserts. Their modular construction being made up of transcription factors could allow the accumulation of mutations or even variants that carry a modest effect of disease risk. Explaining why the changes persist in the genome and that they contribute to type 2 diabetes, colorectal cancers and even coronary artery disease are very fascinating questions. Promoters of the regulatory switches controlling gene expression. They are bound by the RNA Paltu transcription machinery. They can be described in terms of having canonical and non-canonical elements. All active promoters have nucleosome free regions to provide clear access for the transcriptional machinery. Not all of these elements are present in all vertebrate promoters. Some protein coding and micro RNA promoters have these canonical core promoter elements. But a large majority of human promoters or mammalian promoters lack these classical core promoter elements. But instead they contain broad regions such as CPG islands or ATG deserts or transcription initiation platforms. One important protein that is central to most regulatory processes is CTCF. It is an 11 zinc finger protein that functions as an insulator or boundary element. To fulfill these functions there are up to 40,000 CTCF binding sites in the human genome. CTCF partitions the genome into distinct regions of open and closed chromatin domains like fences in a neighborhood. It has recently been shown to bind in tandem to mark the boundaries of chromatin domains and it also functions to insulate enhancers from spurious interactions with unintended promoters. Variants that disrupt CTCF binding alter its normal function in the cell but usually at only one of the two alleles leading to variable consequences. And you can begin to see what changes around the genome in CTCF binding locations could have complex outcomes. Now there are also regulatory regions that affect splicing. A portion of these disrupt events that happen directly in the splice junctions and we all know about these. They change the splice boundary information. Others reveal hidden regulatory elements and what I mean by hidden is that they are embedded within coding sequences. Splicing instructions in central positions of axons contain specialized elements known as exonic splicing enhancers and exonic splicing silencers. These are the ESE and ESS positions in the image. These are binding sites for RNA proteins. So in this case even synonymous substitutions in the DNA can affect regulatory function in the RNA. A similar set of regulatory sites falls in introns and acts as intron splicing enhancers and intron splicing silencers. So I've just told you that introns carry functional elements. Luckily some of these elements look very much like transcription factor binding sites and that and they can be conserved and that gives us some clues to how to predict them. So what have we learned about enhancers? Recently two groups published work on enhancers that are exquisitely long. They call them super enhancers or stretch enhancers. They're so long they look like landing strips at an airport. They're only found upstream with genes that are critical for forming a cell's identity. We find that genetic variations associated with common disease are highly enriched in stretch enhancers. For example, stretch enhancers that are specific to pancreatic islets, harbor variants linked to type 2 diabetes. Also notably this work was published from our own Francis Collins. We also recently learned about enhancer RNA. These are RNA that are expressed as very short pieces of non-coding RNA whose transcription is coupled to a partner protein coding gene. If transcription is lost at the enhancer it is reduced or lost at the protein coding gene. This actually provides a way to predict enhancers using small RNA sequencing. Now while the linear sequence of the genome is very important it gives the impression that disconnected pieces of the genome function in isolation. When in fact genomic function is much more like a game of twister where many distant pieces show connections. So consider the linear enhancer view that I showed you in a previous slide. This enhancer actually functions by looping interactions mediated by the protein CTCF and an accessory protein cohesin. Enhancer A interacts with the promoter to give specialized limb expression and enhancer B interacts with the same promoter to give specialized brain expression. And it's the combination of both enhancers that confer the brain and limb expression patterns seen in the gene expression profile at the top of the image. Enhancer variants contribute to gene expression variability by precluding the proper interactions. In this way only one of the productive interactions could occur in a cell creating an allele specific imbalance. This variability can be very important to the generation of complex diseases. For example Crohn's disease or cardiovascular disease where fine tuning of the amount of products may be important. Now a very strong effect is seen in mutations that affect the sonic hedgehog gene enhancer. This is a gene involved in the formation of digits and mammals and thins in fish. Mutations occur in an enhancer located one megabase away from the gene. These mutations allow unregulated activity of the enhancer during development which creates extra digits. The polydactyl phenotype occurs in humans. It can be reproduced in laboratory mice and it also has seen in Hemingway's cats which are selectively bred at the Hemingway House in Key West Florida if you'd like to go see them. Now in addition to cute cats genomic looping plays an important role in the function of non-coding RNA. The exist non-coding RNA is expressed from one chromosome in every female cell and it coats that chromosome to repress expression in a process called X chromosome inactivation. This image shows two superimposed plots where the distance from the exist expression site shown in the black box is drawn along the X axis. The peaks in red are locations of the exist non-coding RNA along the chromosome. The peaks in blue co-localize and represent the frequency of interaction of those positions of the chromosome with the exist expression site. So in this way exist is delivered to its target sites without ever diffusing through the nucleus. Many other non-coding RNA are delivered through strategies that use this type of targeted proximity rather than diffusion. This allows for very low levels of non-coding RNA transcription to be effective as regulatory molecules. I think this is a very important point because people say well RNA expressed at such a low level probably is noise. This is one example where it's not noise and in fact exist is needed at very high copies but many other non-coding RNA probably are effective at very low expression levels. So if genes have non-random location then a corollary is that chromosomes facilitate interactions through their nuclear organization as well. Chromosome territories are the spatial locations defining the positions of each chromosome. Regions with transcriptionally silent heterochromatin contain gene poor sequences and low overall transcriptional activity. They are pushed to the nuclear periphery. Gene rich regions are in highly transcribed DNA situated in the center of the nucleus. When a gene loses function it often changes its location in the nucleus. Now in newer views of chromosome packaging chromatin from different chromosomes is allowed to expand into the surrounding territories. These are known as inter-chromosomal networks. You can think of this as an explanation for translocations that occur between chromosomes. Double-strand breaks formed in regions of intermingling are more likely to produce inter-chromosomal rearrangements whereas those deep within a territory are more likely to produce intra-chromosomal rearrangements. At the edges of the nuclear periphery are the regulators of structure. In this silenced region important proteins include the lamin proteins. They form an intra-nuclear scaffold known as the lamina around the edges of the nucleus. This lamina connects the chromatin or the heterochromatin to the nuclear envelope. These proteins support the nuclear architecture by separating regions of silenced chromatin from active chromatin. Hutchinson-Gilford-Prageria syndrome is a premature aging syndrome with mutations in the protein lamin A. Symptoms include accelerated aging represented by early cessation of growth, baldness at the age of two, degeneration of the skin, muscle and bone and often fatal heart conditions in childhood. In this disease the attachment of chromatin with the nuclear envelope becomes altered. The nuclear organization is lost along with transcription and transcription-coupled repair that is necessary to correct DNA damage. So you can see how the proper architecture within the nucleus is very important for many of the functions that go on. The reduced ability to address DNA damage triggers cell death and senescence promoting accelerated aging. So what you might all relate to is the fact that the cellular alterations in the lamin A processing seen in Prageria play a role in the aging process in everyone. It just happens at a slower pace. So even in cancer regions of heterochromatin at the nuclear periphery become repositioned in a tumor cell. These regions have special names. They're called LOX and LADS which stand for characteristics of lamina associated domains. These changes indicate regulatory reprogramming is going on in the tumor cells. And rather than conferring ages, aging, the changes confer immortalization. So within the last 10 years, we've all lived through a monumental shift from the gene centric paradigm to the genome scale perspective. And along with this shift came new technologies to study epigenetics defined as changes in gene expression that cannot be explained by changes in DNA sequence. These changes are due to modifications of the histones that wrap the DNA or to modifications of the DNA itself. So I will cover these in the next two sections. At the root of DNA packaging is the nucleosome. This creates the fundamental unit of chromatin, approximately 147 bases of DNA are wrapped around an octamer of histone proteins, including two copies of H2A, H2B, H3 and H4. In nuclei of mammals, the most obvious form of structural genome organization is the compartmentalization into U-chromatin and heterochromatin. Multiple modifications occur on these histone tails as they wag outside of the nucleosome. There are over 150 different modifications that are described both by their position and their type. These include activating marks like acetylation and phosphorylation or methylation which can be activating or repressing. In its simplest form, closed chromatin provides an accessibility barrier and is repressive to gene expression. Open chromatin is a hallmark of an enhancer region and assays such as DNA hypersensitivity can find these locations experimentally. Histone modifications correspond to the accessibility oops. Histone modifications correspond to the accessibility of the chromatin. For example, active enhancers are marked by active histone modifications such as H3K27 acetylation and H3K4 monomethalation. In contrast, closed or inactive enhancers have specific identifying marks as well. These include repressive marks such as H3K27 trimethylation but typically the active mark of H3K4 monomethalation is present. So what does it all mean? It means that we can read the functional state of the genome by histone modifications and just as the Rosetta Stone converted in ancient language to a modern one, the computer program Chrome HMM converts the histone code to types of function in the genome. Yellow enhancers, red promoters and green gene bodies differentially color the genome in every cell type. To illustrate the role of histone modifications, kabuki syndrome is a congenital disease with an intellectual disability. Mutations occur in two proteins KMT2D and KDM6A. These two proteins collaborate in a larger complex to turn repressed chromatin into active chromatin through exchanging the histone marks. Without the full function of these proteins occurring in all cells, you can see the emergence of this disease. Without these functions, in other words, they're not just passive marks but they're required for opening up the chromatin. Now, another aspect of the epigenetic code is conferred by DNA methylation patterns which are specific for each tissue type. For example, DNA methylation plays a role in aging. And as Groucho Marx says, age is not a particularly interesting subject. Anyone can get old. All you have to do is live long enough. But it is an interesting subject because it's a biological phenomenon. And in fact, the longer you live, the more likely the total DNA methylation levels will drop across your genome. This could manifest as more transposable element movement, less rigorous control of gene expression, and select positions might acquire methylation somewhat sporadically. Some of these events occur at tumor suppressor genes which might create a predisposition to cancer. DNA methylation stably alters gene expression frequently by gene silencing. It suppresses the expression of viral genes and it prevents genomic rearrangements between repetitive elements. Therefore, it is crucial for maintaining genomic stability. Now, Waddington coined the term epigenetics in 1942. He described the cellular programming as a dependency on epigenetics and he drew a parallel between the path a cell takes towards terminal differentiation to that of a ball traveling downwards along branching valleys. Once a ball entered into a canal, it could not easily cross the mountains and enter into neighboring valleys. In other words, differentiation into a tissue is set by a series of events that cannot be changed into another type. We know now that the reversal of differentiated states can occur in induced pluripotent cells that occurs through the introduction of multiple transcription factors including oct4, nanog and SOX2. And one can imagine through their interactions with other proteins in the cell, a series of changes occur that culminate in repositioning of the chromatin within the cell to induce a pluripotent state. It is in this state that the repressive DNA methylation marks that gave the cell its differentiated identity are removed. So DNA methylation also takes center stage in behavioral studies. In rats, the normal nurturing behavior of mothers is frequent grooming and licking of their offspring. This behavior is transmissible across generations, but it's not transmitted through the germline. It is a context-dependent effect. As a result of the grooming and the licking, the glucocorticoid receptor in the brains of the pups becomes demethylated. This allows high levels of gene expression in their brains. The result is that there are low levels of the circulating ligand, the glucocorticoid stress hormone. These offspring grow up to be adults who are calm and easygoing and they also exhibit the nurturing behavior towards their own pups. Offspring raised by mothers who exhibit low nurturing grow up to be mothers who are low nurturing. Their glucocorticoid receptors carry DNA methylation and decreased expression. As adults, these pups are highly anxious and stressed and perpetuate the low nurturing behavior. So this phenotype is reversible and there are two ways that can be reversed. Pups of low nurturing mothers can be adopted by high nurturing mothers and they become high nurturing mothers themselves. Pups of low nurturing mothers can have compounds injected into their brains that erase the methylation patterns. When that happens, they grow up to be high nurturing mothers. This tells you that the effects of the environment are mediated through epigenetics, including DNA methylation. Now, methylation is very important in cancer and we can see a reversal of the normal methylation processes where the epigenetic landscape changes dramatically compared to a normal cell. For example, repetitive sequences in intergenic regions that are normally repressed, lose their methylation and become transcriptionally active in cancer. CPG islands, which are normally un-methylated, acquire methylation leading to transcriptional repression. The shores and shelves of regions surrounding CPG islands also gain methylation and block transcription of important genes. And it is through the process of these epigenetic changes that the interaction sites of DNA change within tumor cells. This recently published report of brain tumors demonstrates this phenomenon where two genes that are normally confined by good fences to separate loop domains and rarely interact become closely associated in brain tumors. In this large-scale study, the authors identified roughly 10,000 of these shoelace configurations showing how the genome harnesses form to regulate function. The process includes gene mutations leading to altered positions of DNA methylation which causes interference in CTCF binding sites, the loss of neighborhoods and inappropriate gene expression. So in this final section, I'd like to recap some important points. The ENCODE consortium reported that a surprisingly large amount of the human genome, 80.4 percent is covered by at least one ENCODE-identified element when examined in over 100 different cell types. These elements include transcribed regions, biochemical marks like histone modifications and while it doesn't prove function of 80 percent of the genome, the biochemical evidence complements the genetic evidence, the sequence conservation and the protein coding information. Note that none of these indicators of function are completely overlapping. We can now tally the number of functional elements in the genome. The 20,000 protein coding genes are being rivaled by 13,000 long non-coding RNA and more are appearing every day. 70,000 promoters, a company nearly 400,000 enhancers and over 600,000 transcription factor binding sites. And it's still just the end of the beginning of human genome research because the functional roles of most of these factors are still not known. So regarding annotating the functional content of the genome, let me show you another astute quote by Carl Sagan that the absence of evidence is not the evidence of absence. This idea is illustrated by an experiment conducted many years ago when the deletion of an ultra-conserved region sent shock waves through the community because its loss should no functional effects. The laboratory mice were healthy and they produced healthy offspring and many experiments later, we know of numerous ultra-conserved elements that have a phenotype when they are deleted. But this result provided a unique unpredictable insight on the occurrence of the absence of experimental evidence despite this region having the strongest prediction of function. Now one explanation is that the genome carries elements known as shadow enhancers. They provide redundant function for other enhancers and when one is lost the other can compensate. The number and extent of the duplicity in the genome is a known unknown. Other regions in the genome seem to contradict the rules for function by not showing sequence conservation across species. Link RNA or long non-coding RNA provide an example of functional elements that lack high levels of sequence similarity. However, they might not require very much nucleotide sequence conservation to maintain their functionality. So whereas protein coding genes require a correct linear amino acid code to produce a functional open reading frame, RNA molecules utilize secondary structure as the mode of function. This doesn't require strict linear sequence conservation. If an important nucleotide changes in one location a compensatory change somewhere else can restore the molecule to full function. So as we end I would like to finish up with just some remarks about seemingly non-functional regions. There are regions in the genome that are parasitic. The majority of transposable elements don't add any function to the genome. In fact, they can cause disease when they're not dealt with properly. The genome has specific approaches to fold these regions into segments of repeat pairing structures known as repeat assemblies. Thus as I've told you, TEs are methylated and repressed and they are non-functional in the majority of cases. But their relevance to disease implicates that they should be sequenced and studied across the genome. Finally, let's return to this idea of the vast amounts of non-coding RNA in the genome. We are left with the question is it random? Does the cell use it for functional purposes? Or is it in or is the act of transcription itself enough? A recent OMIC study using a new technique called CHIP-XO cleaves the DNA adjacent to the RNA-POL2 molecule giving base pair resolution to its location. This is shown in the red CHIP-XO peak compared to the CHIP-Seq peak or the older style CHIP-CHIP data each of which have less resolution. This new approach revealed 160,000 positioned RNA-POL2 molecules in the genome where 150,000 of them were not located at protein coding genes. They were located in the non-coding regions of the genome. So right now the debate rages on whether this is spurious binding or regulated binding and whether it's producing functional or non-functional product. But with this new data the tapestry of the genome annotations grow ever more refined. So I will close with this because it's a final quote from Carl Sagan. He says, at the heart of science is an essential balance between two seemingly contradictory attitudes an openness to new ideas no matter how bizarre or counterintuitive they may be and the most ruthless skeptical scrutiny of all ideas old and new. This is how deep truths are winnowed from deep nonsense. I think that's appropriate to studying the genome and I'd like to add this is how function is winnowed from the dark matter of the genome. Thank you very much for your attention.