 All right. Good morning, everyone. And welcome to this fifth week of our current topics in genome analysis course. We're going to try to pull some of the concepts together that you've heard about over the last several weeks. So if we park back to week one, you'll recall Dr. Green devoted part of his lecture discussing the approaches that one could use to look for sequence similarity across very large genomic regions and across evolutionary time. And the whole point of that was to compare different organisms, genomes, to one another to identify evolutionarily conserved regions. If you think back to the last three weeks, last week's lecture by Tira, the previous two weeks of the lectures by myself, we spoke quite a bit about how sequence similarity is determined, how you can visualize those regions, and the kinds of insights you can gain by looking for conserved sequences both at the level of gene and protein families, but also on a genome-wide scale. So we're going to continue to build upon those themes. And this week's lecture is going to focus on the regulatory and epigenetic landscapes of mammalian genomes and the approaches that one can use to elucidate them both computationally and in the laboratory. So today, it's my pleasure to introduce to you my colleague and today's speaker, Dr. Laura Elnitsky. She is a senior investigator at NHGRI, and she did much of her post-doctoral work at Penn State in the laboratories of both Ross Hardison and Webb Miller, focusing on using genomic alignment techniques to detect and analyze regulatory regions. Her studies here at Genome Focus on Bioinformatic and Experimental Approaches that are used to identify non-coding functional elements in the vertebrate genomes, using cross-species comparisons to zero-in on sequences that have remained relatively unchanged throughout evolution, the ones that obviously would be most likely to be functionally important. The ultimate goal of Laura's studies is to really better understand the crucial role that these non-coding functional elements play in establishing normal cell function. She's also been extensively involved in NHGRI's ENCODE project, which as you'll remember from Dr. Green's lecture is the consortium-based effort that's aimed at producing a comprehensive catalog of functional elements in the human genome. So with that, please join me in welcoming today's speaker, Dr. Laura Elnitsky. Good morning. Can you hear me? Okay. Okay. Now you can hear me. Okay. Well, thank you for making it out on this somewhat icy morning. First to comply with government rules and regulations. I want to tell you that I have no relevant financial relationships with commercial interests. And I'll ask you to just please hold your questions until the end so that we can get through the material. Now we are all here today because knowledge about the genome is advancing fast. It's moving from the nucleotide level blueprint to the functional implementation of genomic processes. And today what I'm going to put into context for you is some of what we know and why there's so much excitement moving forward. And keep in mind with any good mystery, lots of questions remain. So today we are going to discuss genome composition, enhancer studies, and epigenetics. Now let me put it into perspective for you. About a week ago, The New York Times ran a page one story about a gene inhibiting protein called REST. And until recently, scientists thought that REST acted mostly in the brains of developing fetuses. But researchers at Harvard found that REST is sharply depleted in key brain regions of people with Alzheimer's and other dementias. This was an eye-opening discovery, and it was offering hope for intervention strategies for this widespread disease that has been virtually impossible to treat. This finding is a direct result of the continuing study of the human genome, and it's only one example of what makes our field so exciting. So in order to understand these breakthroughs, let's take a look at what got us to this point. Since the late 1950s, the central dogma of molecular biology has guided studies of the human genome. It basically says that DNA in the nucleus gets transcribed into RNA, which gets processed into proteins. This dogma implicates the protein component as the vehicle that carries out the cell's important molecular functions. And under this framework, much progress has been made towards Mendelian diseases like cystic fibrosis. And that's because most Mendelian diseases harbor protein-coding mutations. But while all this progress was being made, something interesting was happening to shake our relative comfort in the central dogma. So back in 2000, President Clinton, standing next to this distinguished scientist known as Francis Collins, announced the completion of the draft human genome. And he called it the most important and the most wondrous map ever produced by humankind. And he was right. And that announcement marked the beginning of a new era. We're now 14 years later, and we're on the horizon of having personalized medicine. And that is the ability to diagnose and treat diseases based on the genetic makeup of an individual. Now, of course, reaching that goal is anything but simple. This is because the genome is holding onto its secrets very tightly. So the image on the left represents our sequence genome. It's completely linear. It follows one base after another. And it's a simple track going from point A to point B. And this track is very straightforward. But if you think about it, it carries very important cargo or information. Likewise, the genome, the track on which the genome travels, represents much more complicated information than just the simple linear sequence. So if you've ever ridden one of these amusement park rides called the Scorpion or the Freefall, you're braver than I am. But you know that even though you're strapped in with a seatbelt and a shoulder harness, you're still screaming for a dear life. And just like the dynamics of that ride, the genome itself comes alive in 3D. So let's look at how the current data help us understand some of the things that the central dogma didn't explain. The basic facts about the human genome are its size, roughly 3 billion base pairs, and its number of protein-coding genes, currently standing at about 23,000. Those 23,000 protein-coding genes account for only 1 to 2 percent of the whole genome sequence, shown by that tiny sliver. The rest is divided into non-coding sequence of two main flavors. These are the repetitive elements, which contain transposable elements, remnants of transposable elements, and simple repeats. The other portion is non-coding, non-repetitive sequences. Now as far as we know, much of this 98 percent of the genome has no obvious function. Yet we recognize that regulatory elements are embedded there next to non-coding sequences. And these two are seamlessly intertwined, at least from our base-level viewpoint. So it's interesting that in the early years of genome study, guided by the central dogma, the focus was on protein-coding regions. And it's hard to imagine this now, but at first scientists were estimating that there were as many as 100,000 protein-coding genes. Now that we know that there are only 23,000, the question is, what does the remaining 98 percent of the genome do? So despite having a sequence of the genome, there is still a raging debate. How much of it is functional? How much of it is superfluous? And how well can we distinguish between the two? Now at one point in the 1970s, when protein-coding genes were the primary focus of study, the non-coding regions were called junk. In the 80s, some people proposed that that title be limited only to regions containing transposable elements. Now in the sense term, in a sense the term junk leaves room for interpretation. Since while garbage is stuff you know, garbage is stuff that you throw out, junk is stuff that you keep around, because you're not sure whether it has value. So think about the junk tour most of us have in our kitchens, I know I do. We're afraid to throw that stuff out, either because we're attached to it, or because we think that someday it might be useful in some context. Now as evidence grew for the undeniable presence of regulatory regions, a better term than junk emerged. This was the term dark matter, and it was named after the dark matter in the cosmos. Today we've come to appreciate that a small number of repetitive elements have even joined the ranks of regulatory elements. This is a process known as acceptation, and the point I want to make is that regardless of whether transposable elements are functional or not, they contribute to the etiology of disease. This necessitates that we know more about them, they cannot be ignored. So one of the fascinating things we've discovered in the process of sequencing the human genome is that among the planet's living creatures, our position on top of the heap in terms of complexity fell into question. We now know that the haploid human genome of 3.3 billion base pairs falls somewhere between a lungfish and a turkey. This is known as the C value paradox, the fact that organismal complexity does not increase with the number of base pairs. And to make this really humbling, some single cell protists and even salamanders have genomes much larger than we do. Now the number of genes in an organism doesn't reflect its complexity either, where humans finish somewhere between a chicken and a grape. This striking observation suggests that the origin of complexity lies elsewhere. So how might it be generated? Of course there's cell type diversity. We also know that through RNA splicing, a single gene can produce more than one protein. Humans for instance produce around 100,000 different proteins from the 23,000 defined genes just by differential splicing. Codal complexity is contributed by the non-coding portion of the genome, although this is the source of much heated debate. In fact ENCODE has shown that the majority of the genome is transcribed at some point in some cell type. Furthermore researchers at Penn State have also shown that RNA polymerase 2 initiates transcription at over 100,000 locations in a cell. Now roughly 10,000 of these correspond to protein coding genes in any particular cell type. This implies that the majority of these events are regulated initiation events producing an extensive amount of non-coding RNA. Now Greta we can't predict the full range of functions that non-coding genes may serve and I'll discuss non-coding RNA later in this talk and I'll give you a beautiful example of its function. Now in addition to non-coding RNA, the non-coding genome also contains all the regulatory features that we are familiar with like promoters, enhancers, UTRs, but keep in mind there are alternative 5' UTRs, alternative promoters that don't get talked about much, there are alternative 3' UTRs that give diversity to a cell, there are myriad enhancers that haven't been characterized, there are intronic regulatory elements that need to be discovered and think about it mammalian replication origins still remain out of our ability to predict even though we know they are there. Later in this talk I'll go more into epigenetic modifications, that's certainly one way that diversity is being provided to the human genome. Now underscoring all of this is the fact that most elements in the human genome have not been subjected to functional analysis, that in itself is fraught with ascertainment difficulties. We're now using genomic sequence to successfully uncover the story of human evolution because it refines the historical amounts, accounts of the early migration of humans out of Africa and into the Middle East and beyond and you've probably read that having the sequence of the human genome available was crucial to the discovery of unmistakable traces of Neanderthal DNA in some modern non-African human populations. So yes I am a Neanderthal. The genetic record gleaned from the fossil evidence shows interbreeding between Neanderthals and early modern humans which led to both the beneficial and disadvantageous genetic material being introduced into modern humans. For example, genes useful for coping with climates that were colder than Africans were acquired but probably at the cost of significant fertility problems. There are also regions where Neanderthal derived DNA is noticeably absent. One large chunk of the modern human genome that bears no Neanderthal contributions encompasses the gene Fox P2 which is involved in human speech. That's not to say that the Neanderthals didn't have speech but probably that it derived in a different way. Now there's also been some fun involved in studying genomes as well. For example, the scientific community generally regards the Yeti or the Bigfoot as only a legend but this is one of the most famous legends in cryptozoology. That's the study of animals that don't, we don't have evidence that they really exist. Analysis of samples claimed to be a Yeti found a sequence of mitochondrial DNA that matched an ancient polar bear from bones that were found in Norway dating back 40,000 to 120,000 years ago. So it looks like at least one mystery has been solved. The studies of the human genome are driven by suspense and competition. While before the year 2000, the race was to sequence the human genome. Today the race is to identify and understand all of the functional elements in the genome. New competitions include the $1,000 genome, identifying disease-causing variants and even the undiagnosed disease program of the NHGRI and each is upping the ante for the use of technology towards the interpretation of the non-coding human genome. In the meanwhile, we continue to read about people who have made health decisions based on the analysis of their DNA such as Angelina Jolie's decision to have a preventive double mastectomy when she found she carried a gene that can cause breast cancer. So this information is empowering, but in fact, only 5 to 10% of breast cancer cases can be attributed to the major heritable mutations like BRCA1 and BRCA2. Many of these undiagnosed cases occur in women, 85% of undiagnosed cases occur in women who have no family history of breast cancer. Many of them as a result of aging. Now because these elusive causal mutations will be distributed between coding and non-coding regions, it is precisely these scenarios that fuel the quest for finding functional elements in the genome. So in the early days of genome sequencing, it was a needle in a haystack problem. The quest to find signatures of functional elements in the human genome was performed by comparing whole genome sequences of the human to whole genome sequences of the mouse. Researchers predicted that some proportion of the human genome was more conserved than expected when compared to the level of conservation found in neutral sequences. So neutral sequences such as ancestral repetitive elements were modeled for the rate of change over evolutionary time that's shown by the red curve. They were compared to all genomic alignments shown in the blue curve. These data curves were then deconvoluted using statistics to identify the small region of the gray curve indicating that 5% of the human and mouse genomes showed evidence of functional conservation. This represented 1.5% of the genome that contained coding regions and 3.5% that contained candidate regulatory regions. Now I say that these are candidate regulatory regions because alignment is not proof of function and the statistics certainly don't tell us what the functions are. Since then the amount of the genome under functional constraint has been recalculated many times. Today estimates range between 5 and 15% under the most critical analyses, but that's still a very long way from 98%. Today in order to narrow the search for meaningful pieces of non-coding genome, we align genomic sequences from species that are very different from us. This is to give a large evolutionary window. And the reason for that is that coding regions and regulatory elements are protected from random drift across evolutionary time by purifying selection. This is the pressure to remove deleterious changes from a sequence and it provides a very strong signal of regions like transcription factor binding sites whose sequence has been maintained because it is critical for a biological function. So here are some transcription factor binding sites from the locus control region of the beta-globin gene, hypersensitive site 2 that come actually from a long time ago from my graduate studies. Now sometimes we want to study specific functional elements in the human genome by comparing them with other primates. This happens in the case where these elements don't appear in more evolutionary distant species such as mice. We might be studying neurobiology or anatomy or immunology in these cases. But because comparisons of human and chimp genomes are too close, we use a technique called phylogenetic shadowing. This is similar to the concept of phylogenetic footprinting. But shadowing aligns multiple primate sequences to find strongly conserved sequences and eliminate less well conserved regions. In this way, we can focus in on only those regions that don't change over time. Now some very extreme examples of negative selection exist in the genome. By far the coolest may be the ultra conserved elements. These regions are longer than 200 base pairs and they have 100% identity with no insertions or deletions between the human, rat, and mouse genomes. It's pretty amazing. Now when ultra conserved elements were first identified, the thought was that they represented traces of contamination of human DNA in the mouse and rat samples. And it was only later shown that they actually belong in these genomes. Although this extreme level of conservation predicts some functional importance, many functions of ultra conserved elements have not been characterized. Now as you might predict, many of them represent developmental enhancers. Still others of them act in a dual role where they are both exons for the gene in which they are embedded and they are enhancers at the same time for neighboring genes. And there is compelling intrigue surrounding some ultras which act in completely unpredictable ways. For example, some are used as alternative exons and they're called poison cassette exons. In other words, they have an in-frame stop codon embedded within them and they are used to invoke nonsense mediated decay of a messenger RNA transcript. It's completely effective at removing those mRNAs from production, but at the same time it seems counterintuitive. Now not all functional elements are conserved. On the opposite side of conservation, some genes are under selection to increase the number of nucleotide changes in a particular species. This is known as positive natural selection, the force that drives the increase in advantageous substitutions. A well-known example, although you might not think this is advantageous, is the sickle cell mutation in the beta-globin gene. In the heterozygous form, it actually protects against malaria. And this connection explains why the mutation has been retained and why the incidence of sickle cell corresponds to regions endemic with malaria. Another example is a mutation that allows lactose tolerance to persist into adulthood. This mutation is in a regulatory region near the gene for lactase. This particular variant was apparently selected in parts of Europe as the domestication of cattle was taking place. And it's fascinating to see this parallel between the geographical and the genetic data. So in 2006, an entire collection of regions with accelerated divergence was identified. These are known as human accelerated regions, or HARS for short. They were conserved across species, as you can see in this image. But they had accelerated changes in the human sequence, even relative to CHIMP. Now one of these sites contains a gene enhancer function that targets expression to the opposable thumb. And it may also be responsible for modifications in the ankle or foot corresponding to walking upright. So as you can see from the image, using the CHIMP sequence in a transgenic mouse, there is no expression in the limb buds. But when the human substitutions are added to that sequence, suddenly you see expression in the limb buds of the transgenic mouse. And likewise, starting with the human sequence, if you remove those human specific substitutions back to the CHIMP sequence, expression is lost. So I just want to make the point that while accelerated regions are recognized as an evolutionary adaptation, there are other processes in the genome that can produce an increase in the number of base changes. And the point is that our ability to decipher mechanisms in the genome by computational means is constantly advancing. And with those refinements come better interpretations of the biology. So like I tell my students, it's never enough to just look at something once. Much of what we know about the genome is only the tip of the iceberg, and most of the secrets are still hiding beneath the surface. Only when we understand more about the regulatory and the non-coding DNA, non-coding RNA, that's captivating our attention will we understand the role of the dark matter in the genome. And until then, we don't know how much we don't know. We do know that the impact of sequence variants in diseases ranges from single gene disorders to complex multigenic diseases. And strikingly, these categories have very different underlying genetic architectures. Mutations causing single gene Mendelian diseases are rare, and they largely occur in protein coding regions. In contrast, genetic variants that are present in complex diseases such as Alzheimer's and type 2 diabetes have high minor allele frequencies, indicating that their presence is common in the population. These diseases are giving us insight into enhancer activities, but this actually came as a surprising result. The approach to studying complex diseases was extremely challenging because our genomes are so large, and whole genome sequencing has been so prohibitively expensive. So scientists wanting to study these diseases needed a shortcut. The work of the HapMap Consortium provided that shortcut as a collection of genetic variants represented on an array chip. First, the single nucleotide polymorphisms were identified in DNA samples from multiple individuals. Next, the adjacent SNPs that are inherited together were compiled into haplotypes. And finally, these haplotypes were further reduced to a small number of variants called tag SNPs that served as a proxy for the full haplotype of an individual. So here we're seeing positions of three tag SNPs that identify four haplotypes carried in individuals. Now the genotyping arrays carrying the tag SNPs were then used for genome-wide association studies or GWASP. This is where DNA from one large group of affected individuals is compared to a similar size group of unaffected individuals. The SNPs that are statistically associated with the disease will be overrepresented in the affected samples. So to date, more than 1,400 elusive disease-associated variants have been identified in the genome using GWASP, and by now that number is probably out of date. They're coming fast and furious, and the vast majority of them lie in non-coding regions. This was actually a big surprise, and it provided enormous opportunities to further study the contributions of the non-coding sequences. So the premise is that these variants are hitting enhancers that are located either in introns or intergenic regions, and that these variants disrupt transcription factor binding sites within these active enhancers. This premise is borne out by studies of GWASP variants. In these cases, variants fall into enhancers that are active in the affected cell type. In other words, here we're looking at DNA's hypersensitivity activity in each of different cell types. DNA's hypersensitivity is a marker for an active regulatory element. So what we see in the first panel is that variants identify active immune cell enhancers that are found in Crohn's disease, likewise in multiple sclerosis. In the next panel, variants identify active cardiac cell enhancers in cardiovascular disease. In the next panel, variants identify active prostate cell enhancers for prostate cancer, and in the last panel, variants identify active brain cell enhancers for neurological diseases. Now what we are looking for is a strong phenotype that occurs when a GWASP SNP prevents binding of a transcription factor and disrupts the entire function of the enhancer. Actually, single base changes in enhancers often have very small effects due to the modular construction of enhancers. And this makes sense. It explains why we're seeing modest levels of disease risk found from GWAS studies. It also explains why these diseases are perpetuating in the population. These are very small changes, but in fact they are leading to enormous health burdens. Now one note of caution about GWAS SNPs. If you find one, you need to be aware that they are often not the SNPs that cause the disease. This is because SNPs are tightly linked in a region of a haplotype and the statistics might misleadingly point to one SNP, whereas the function might be provided at a nearby position. So to assign causality, it is necessary to validate function. This is often done in a multi-step fashion. For example, conservation can be assessed in the UCSC genome browser. And keep in mind that regulatory elements can be located in intragenic regions, as I'm showing here, or far from the genes they regulate in intergenic regions. Now there are another, a number of other data types that can facilitate the interpretation of inactive regulatory region, including transcription factor occupancy, DNA hypersensitivity, fair data, which is another proxy for open chromatin, and chromatin modifications, which we'll talk about later in this talk. Data for each of these characteristics are available in the UCSC genome browser. Now taking validation one step further, as we've seen, transgenic mice can be created that test the expression patterns conferred by enhancer sequences. Here I'm showing a screenshot of the Vista enhancer browser at Lawrence Berkeley National Lab. Here they display enhancer regions that have been tested in transgenic mice. You can go to this site and you can select a tissue-specific type of enhancer activity that you're interested in, here I've selected heart, and you can see the expression patterns in developing mouse embryos, as shown in the panel on the right. You can select sequences from human or mouse genomes. You can see both positive and negative outcomes for any sequences that they have tested. You can even download the sequences representing these enhancers for further studies of your own. Now one problem with an enhancer assays is that they take a long time and they're not easily performed on a large scale, however new approaches such as star-seek are emerging for the assessment of enhancer elements in an unbiased way. In this assay, fragmented genomic DNA is cloned into reporter vectors. It is then transfected into culture cells, and the RNA is sequenced. The amount of RNA expressed in any of these cells relates directly to the level of the enhancer strength, and because the enhancer fragment is incorporated into the transcribed RNA sequence, it can be mapped directly to the genome. So in this way enhancers can be discovered, they can be mapped, and they can be measured all within the same assay. Now it's a big task, but once the strategic mapping of all enhancers in the genome is completed, as I would like to see, the interpretation of disease-associated variants could happen much more quickly. Comprehensive studies of enhancers are revealing information that we never knew. For example, two new studies recently reported very long stretches of active enhancers near genes that are essential for a given cell type. These are termed super and stretch enhancers, and the idea is that a long runway of enhancer sequence appears upstream of genes. This is like a regulatory red carpet. These extended regions are thought to be so long to ensure that the genes get expressed in critical cell types. Moreover, the authors of these studies are showing that GWAS variants are enriched in these super-enhancer regions. This supports the idea that complex diseases are arising in tissue-specific regulatory elements. Now I want you to be aware of the fact that 50% of predicted enhancer regions do not show a function in enhancer assays. Sometimes this can really dismay somebody doing these types of experiments. But this may be explained by many reasons. For example, the cell type might be wrong for the expression type that is necessary. The DNA being tested might be missing some important component of the enhancer element. The promoter used in the reporter vector might not want to talk to that particular enhancer. So despite all the things we know about enhancers, sometimes it's still very difficult to tease out their activities experimentally. And the one point that I haven't mentioned is perhaps these elements that we're predicting as enhancers have a completely different function, and so of course they don't work as enhancers. So these failures highlight the fact that the complexity of recreating specific regulatory schemes in a laboratory falls far short of the endogenous complexity. It also warns us that a negative result is not always the final word. Now for example, most of us probably have a backup of our computer files to prevent accidental loss. The genome has a backup system as well, carrying spare enhancers known as shadow enhancers. These enhancers map far from a target gene and they preserve the activities of the primary enhancer. In this way the loss of a primary enhancer might not be so devastating. And shadow enhancers return us to the conversation of ultra conserved elements. And four of these elements with known enhancer functions were deleted in the mouse genome. No functional consequences occurred. Let me tell you, this was shocking to the field at the time. But it delighted the mice, I'm sure. So one explanation is that the shadow enhancers compensated for the missing regulatory activity. So while we have a lot to learn about enhancers, GWAS studies clearly demonstrate a framework for unraveling the genetic basis of complex traits in which the majority of common complex trait-associated variants fall into non-coding regions and they have modest effects. But keep in mind that much work needs to be done to further unravel the undiscovered causes of genetic diseases. And because of that, I want to highlight an example that emphasizes that most diseases will have varying proportions of coding and non-coding contributions. So in this case, after GWAS studies were completed, additional rare variants for inflammatory bowel disease have been found in the coding regions upon deep resequencing. Now the genotyping arrays cannot assess the presence of these very rare variants because they're not part of the arrays. These are personal variants. So we know that rare variants can affect coding regions, right? But where it gets really interesting is that rare variants in coding regions of genes can also affect the regulated expression of the gene. For example, variants affect splicing of messenger RNA when they're located directly adjacent to the splice junctions. We all know that. But even more camouflage positions are embedded directly within the coding sequences. These positions carry binding sites for splicing proteins. This explains why even synonymous substitutions can have deleterious consequences, including devastating effects from exon skipping and nonsense mediated decay. These are known as exon splicing enhancers and exon splicing silencers. Now a complementary set of these binding sites fall in introns of genes known as intronic splicing enhancers and intronic splicing silencers. These elements are adding to the mysteries of the dark matter because we know they are there, but they've been very difficult to identify. Now the Holy Grail of non-coding studies is illustrated by single mutations and enhancers that cause disease. Although very few are known, here's an example. In the case of the sonic hedgehog gene, a gene involved in the formation of digits and fins, point mutations in an enhancer region create extra digits, signaling a loss of repressive sites or a gain in additional enhancer activities. We've all seen cats with six toes, providing a real-world example of this process. The enhancer for this gene lies one megabase away from the sonic hedgehog gene in an intron of another gene. And what's really interesting is that this entire regulatory element is lost in certain species. These include snakes and a limbless newt. And this loss could possibly explain their limbless morphologies. So what's missing from the central dogma of molecular biology? It's the role of epigenetic processes, factors that control genetics other than the DNA sequence. These epigenetic processes, in effect, are an additional language of the genome. So in the second half of the talk, we're going to talk about four areas of epigenetics, including chromatin, non-coding RNA, nuclear architecture, and DNA methylation. So for example, the histone code begins with the packaging of DNA. This process has practical applications, because two meters of linear genome has to be fit into a single cell. It's this packaging that represents the second language of the cell. Chromatin modifications occur on the histone tails of a nucleosome. These modifications include methylation, acetylation, and phosphorylation, among others. This packaging creates u-chromatin, the open and accessible form of transcribed DNA, and heterochromatin, the closed and transcriptionally silent form. Over 150 different epigenetic marks are catalogued, and they indicate the activities of the underlying DNA. So think of this as a Rosetta stone. If we know the language of the histone modifications, then understanding the genome falls into place. Now books are being written on this topic at the most basic level. Histone acetylation is placed in active regions by a histone acetyltransferase, or a hat, and it is removed in inactive regions by a histone deacetyltransferase, or an H-DAC. Histone methylation, on the other hand, can participate in both active and repressed chromatin structures. And here I'm showing histone-3 lysine-4 mono-methylation, or H3K4Me1, which acts as an enhancer mark when enhancers are active. I'm also showing H3K27 trimethylation, which acts as a repressive mark by interacting with the polycomb protein. So the chromatin code contains writers for placing these marks, erasers for removing these marks, and translators for reading the marks, and also for recruiting larger protein complexes. Now when these systems go awry, it's fascinating. It's devastating, but it's fascinating. The internal machinery gets confused. So for example, cancers of the blood result when genes cannot be turned off during normal differentiation. The MLL protein normally coordinates this repressive response in blood cells, but in leukemias, MLL requires new functions by fusing to partners through chromosomal translocations. Many of these partners have acetylation activities. So MLL carries them throughout the genome to all the places that it would normally place a repressive mark. They in turn place activating marks. This transforms the entire landscape of the cell into a different message. Now in fact, mutations in epigenetic modifiers are well documented as causes of cancer. I'm only showing you a few here. There are many, many more known. So if you think about it, perhaps in these cases a treatment for cancer may come from our ability to modify the epigenome or to interfere with these processes. So now I want to turn to non-coding RNA. They are truly the hidden treasures of the genome. They also have been described under the domain of epigenetics because they affect the regulation of genes. Link RNA represent a large class of long non-coding RNA that are often poorly conserved. And they regulate gene expression by diverse mechanisms that are not yet fully understood. Link RNA are poorly conserved. Now that may be because they only function in one or a few species and don't have millions of years of selective pressure to maintain their sequences. Or it may be that their primary sequence is not as important as maintaining their secondary structure. So in other words, they represent a novel class of functional elements where sequence conservation is absent despite the presence of function. That's probably the most important thing I'm going to say from this whole talk. It's a very important point. Now recently four roles of Link RNA were proposed. They can serve as guides to recruit the chromatin modifying enzymes to their target genes, the methyl transferases, the acetyltransferases. They can serve as signals to indicate that gene regulation is happening at some particular area. So we can use them as a display for finding functional elements in the genome. They can also act as decoys to titrate transcription factors away from a target region. Link RNA are also known to serve as scaffolds. They bring large protein complexes together. Given these important roles in the cell, it's not surprising that the misregulation of Link RNA is associated with oncogenic outcomes. And perhaps, if you think about it, maybe even GWAS variants are falling into these functional regions. So here's the example I promised you. The best known example of long non-coding RNA is the exist gene. It plays a role in mammalian inactivation. In females, one of two X chromosomes must be inactivated, and that occurs through the function of the exist gene. This transcript binds to the entire length of the chromosome from which it is transcribed. It's only transcribed from one, and it binds to that chromosome, and it inactivates that chromosome. Recently, it was very elegantly shown that exist spreads across the chromosome by using contact points established by the 3D architecture of the chromosome. It then interacts with the polycomb complex to repress gene expression. Now imagine if exist were transplanted to another chromosome. What might happen? Researchers are exploring this concept with Down syndrome, where a third anomalous copy of chromosome 21 exists. They've been able to show, at least in cultured cells, that by inserting the exist gene into that third chromosomal copy that they can successfully silence expression from that chromosome. This speaks to the power of using genomic processes for therapeutic applications. It's really fascinating, and it strongly emphasizes how we have to understand the processes of the genome before we can use them to the most advantageous purposes. Now like link RNA, microRNA or another class of non-coding RNA, they have much smaller sizes. These RNA transcripts are processed to create a 22-base pair message that targets messenger RNA and causes degradation of the primary transcript in the nucleus, or interferes with translation in the cytoplasm. There are more than 1000 microRNA identified, and their specific expression patterns are being investigated in disease samples. It appears that their primary role is to create balance among transcripts within a cell. And in fact, microRNA are known to target the enzymes that are the epigenetic modifiers of the cell. So the final class of non-coding RNA I'll mention today is called eRNA. These are short transcripts that are produced at the site of enhancers, and they are necessary for transcriptional activation of those genes. When levels of these non-coding RNA are reduced, transcription levels of nearby genes are also reduced. To me, eRNAs are fascinating because they offer a powerful new way to predict where enhancers are throughout the genome. So as we've seen, non-coding RNA find their targets through proximity and nuclear organization. And that is critical for these processes to occur. And for those reasons, you are going to hear me call genomic architecture an epigenetic process. Now this same chromosomal structure is necessary for regulatory elements to find their targets. For example, enhancer elements can be very distant from the genes they target. And in order to talk to those genes, these enhancers come into physical contact with a promoter by utilizing a looping strategy. And there are several experimental techniques out there, such as chiopet, to try and discern these long-distance interactions. One important protein that is central to most regulatory processes is CTCF. It functions at tens of thousands of sites in the genome as an insulator and a boundary element. CTCF is heavily involved in CFTR regulation, which is the gene-physistic fibrosis. By forming loops and by interacting with additional proteins, such as cohesin and condensin, whose names sound like their functions, bringing sequences together. In this way, CTCF partitions the genome into distinct regions of open and closed chromatin regulatory domains. Now what remains missing and what we desperately need in descriptions of gene regulation is to integrate information about how nuclear organization contributes to the regulatory interactions that occur between chromosomes. And one of the most important discoveries in the field of genome biology has been the demonstration that genomes are non-randomly organized in the nucleus. Chromosomes are organized into territories where transcriptionally silent heterochromatin and gene-poor chromosomes are localized to the nuclear periphery, whereas gene-rich chromosomes in highly transcribed u-chromatic DNA is situated in the center of the nucleus. Now these localization patterns of specific genes will change depending on their expression patterns. We need assays to quickly visualize these differences. Right now it's quite challenging to do this. Now if we look at the silenced region at the nuclear periphery, there are important proteins known as lamin proteins. They form a scaffold known as the nuclear lamina around the edges of the nucleus. Lamina means layer. This layer supports the nuclear architecture and helps to organize nuclear process such as DNA and RNA synthesis. So in Hutchinson-Gilford progeria syndrome, this is a disease caused by mutations of the protein lamin A. The outcome is a premature aging syndrome. Children with this disease stop growing early. They're bald by the age of two. They have progressive degeneration of the skin, muscle, and bone. And they often suffer fatal atherosclerosis in childhood. All this occurs because without lamin A function, nuclear organization becomes unstable. This throws the regulatory patterns of the cells into disarray. So we've talked about epigenetics as chromatin, non-coding RNA, and nuclear organization. The final component of epigenetics we will discuss involves DNA methylation, and it is so integral to gene expression that methylated cytosine is often referred to as the fifth base. So there's ACGT and methylated C. Now, DNA methylation is a process that stably represses genes as cells divide and differentiate. So as they move from embryonic stem cells into specific tissues. And this is illustrated in this classic example of a ball following a path to roll downhill. Each stage of the descent adds further restrictions, and these can be seen as differential methylation, to refine the identity of the developing cell. Methylation also binds proteins known as methylated DNA, also binds proteins known as methyl CPG binding domain proteins. These proteins interact with the histone deacetylases forming heterochromatin. And it's this link between DNA methylation and chromatin structure that is very important. So an illustration of a disease caused by methyl binding protein mutations is Rett syndrome. This is a devastating neurological disease affecting one in 15,000 children, nearly all of them female. These children seem to develop normally for 6 to 18 months, then affected girls lose interest in play, they become withdrawn and anxious, they develop autistic-like behaviors, and they acquire symptoms like repetitive teeth grinding and hand ringing. It turns out that the mutation affects the chromatin structure among at least two genes that are misregulated in this disease. The normal looping on the left to form silent methylated chromatin fails, and instead turns into acetylated inactive DNA, which changes the expression pattern of those genes. So in this case, it's not the binding to methylated DNA that's the problem, but it's the message that gets translated out into the rest of the cell. Methylene methylation is also fascinating because it mediates gene environment interactions. So studies in mice show these effects with markers for coat coloration. Mothers who are supplemented with food containing methyl donors give birth to progeny whose children have mostly normal methylation patterns as shown by the Brown fur. In contrast, mothers lacking the supplement give birth to progeny whose pups have more abnormal methylation patterns as indicated by their yellow fur. The conclusion is that a poor diet during pregnancy can affect the epigenetic processes of a woman's grandchildren. But the implication is that we all need to be careful about what we eat. Now epigenetic changes can also be caused by the environment to affect behaviors. So mouse pups whose mothers licked them within the first week of life showed reduced anxiety and lower stress responses in adulthood. These findings show that long-lasting epigenetic changes occur in the brain, and they underlie lifelong differences in behavior. But keep in mind epigenetic events are dynamic. And the notion is emerging we can reverse negative health influences by targeting everyday choices such as eating high nutrient density food, avoiding junk food, allergens, toxicants and infections, getting plenty of exercise and sleep, minimizing stress, and nurturing each other better. This is exactly what our mothers told us to do. DNA methylation also plays a role in numerous genomic changes seen in cancer. In tumors, DNA methylation appears in regulatory regions like CPG islands where it is normally absent. Oftentimes, this represses expression of important tumor suppressor genes. Now conversely, normal methylation is lost throughout the genome in cancers. This facilitates genomic rearrangements that occur between transposable elements. This is another hallmark of cancer. So these observations taken together suggest to us that DNA methylation plays a direct role in maintaining genomic integrity. Now despite dramatic progress in epigenetics during the past decade, DNA demethylation remains one of the last big frontiers in genomics with very little known about it. Methylene can be removed by many processes, for example, GAD45 directs DNA repair processes to replace methylated bases. Tet proteins, on the other hand, convert five methyl cytosines to five hydroxy methyl cytosines in a progressive progress. You'll hear a lot about five hydroxy methyl cytosine in the future. It's very new and it's opening up the field. So let me end with an update to the classic scenario posited by Waddington, updated to fit the modern genomic landscape. In it, the conformational architectures within the nucleus alter the steepness of the valleys to influence the destination of the rolling ball. The chromatin modifications, shown in red, DNA methylation shown in blue, lamin proteins shown in green, and the chromosome looping proteins, shown below in red, work in concert to position these valleys and determine the cellular phenotype. The vast majority of these events are implemented in the dark matter of the genome. So to conclude, I began this talk by saying that the central dogma was important in guiding decades of research in molecular biology. Over the past 14 years, the sequencing of the human genome has advanced our understanding of evolutionary diversity among species. It has highlighted the importance of non-coding sequences and accelerated detection of disease processes. It now provides a platform for understanding how epigenetic processes emerge from the dark matter of the genome to shape the biology of ourselves and our own integration with our environments. With this knowledge in hand, the stage is being set for personalized medicine to provide the appropriate treatment at the appropriate dose for the appropriate patient at the appropriate time for the appropriate outcome, and it would be hard to imagine a better result. So with that, I thank you, and please feel free to come down to the front to ask questions.