 Okay, good morning everyone, let's go ahead and get started. We're going to hark back to the very first week of this course where Dr. Green devoted a good part of his lecture to the approaches one can use to look at sequence similarity across a large number of genomes across evolutionary time comparing each one of those organisms genomes to one another, and the kinds of insights you can get by doing that, looking for conserved sequences across organisms, then in weeks two, three, and four, hopefully we gave you a good sense of how you actually do that by doing pairwise sequence similarities and how you start to visualize them through things like the UCSC genome browser. So we're going to build on all of those themes in this week's lecture that's specifically going to look at regulatory and epigenetic landscapes in mammalian genomes and the approaches that one can use to elucidate those, both from a computational standpoint and from a laboratory standpoint. So it's my pleasure today to introduce to you Dr. Laura Elnitsky. Laura is an investigator in the genome technology branch at NHGRI. She did most of her postdoctoral work at Penn State with both Ross Hardison and Webb Miller, and while she was there, she was focusing on genomic alignment techniques to detect and analyze regulatory regions. Her studies here at Genome are focusing on bioinformatic and experimental approaches that are aimed at identifying non-coding functional elements in vertebrate genomes, and she uses cross-sequence comparisons to zero in on sequences that have remained relatively unchanged throughout evolutionary time. She's also been very extensively involved in the ENCODE project, and as you all will recall since we mentioned the ENCODE project quite often in this course, that's the comprehensive effort that is based, that is intended to develop an encyclopedia of all of the functional elements in the human genome. So with that, please join me in welcoming today's speaker, Dr. Laura Elnitsky. Good morning. We all recognize that our DNA is full of elusive genomic attributes. These include our physical traits, our susceptibility to illness, and behaviors that are uniquely our own. So today I want to talk about an attribute that is more relevant to this topic, the elusive functional element. So in 1975, King and Wilson were studying the differences in protein-coding genes between humans and chimps, and they concluded that the modest divergence observed in protein sequences cannot account for the profound phenotypic differences between humans and chimps. And they proposed then that regulatory mutations must be primarily responsible. Now this is totally plausible today, but 35 years ago this was a revolutionary idea. So let's look at the numbers. We know that 5% of the genome is under negative selection based on comparisons of the genomic sequences between humans and mice. 1.5% of that total represents coding sequences, and the question remains how much of the genome is functional. If you do the math, 3.5% of the non-coding sequences are under measurable selective constraint, but we have no universal way to measure what is functional in the genome at this point, and therefore that remains the million-dollar question. So I'm going to cover five discussion points. Those will be the nuclear architecture and how that plays into gene regulation, the spectrum of genomic mutations that are being analyzed today, regulatory sites that are affected by mutations, epigenetic modifications, and DNA methylation in cancer. Now we tend to think of the genome as a static landscape with very well-defined features, and even the title of my talk alludes to that as I call it a landscape. But contrast that with the vision of a regatta on this open sea, where there's very little constraint, there's lots of movement, and the only thing that keeps these boats from running into each other are the fact that they have anchors and good captains. Regulation in the genome is somewhere between these two extremes. It has constraint, it has directionality, but yet it's very fluid and there's lots of movement, and so keep that in mind as we go through this information. So let's start with the spatial organization within the nucleus. There are four points that I want you to know. The first one is that individual chromosomes occupy distinct positions in the nucleus. These are referred to as chromosome territories. The second one, different chromosome segments adopt a complex organization and topography within their chromosome territory. Stylomeric regions then tend to be pushed towards the nuclear periphery, and gene-rich regions tend to be oriented towards the nuclear interior, whereas the gene-poor regions or nontranscribed regions tend to be oriented toward the periphery. The fourth point, this polarized intranuclear distribution of gene-rich and gene-poor chromosomal segments has been shown to be an evolutionarily conserved principle of nuclear organization. This is shown in human cells and it's also present in Drosophila and worm cells. So here's an image of the nucleus. And I'll point out features like the nuclear lamina, which is right at the edge of the nuclear envelope. The nucleolus, which is the central hub for ribosome assembly. Here are chromosome territories. Each of these colors represents a chromosome in an interface nucleus. You see how spread out these chromosomes are. You see that they have boundaries adjacent to one another. They have boundaries adjacent to the nuclear lamina. Here are other objects in the nucleus, such as nuclear speckles. There are many different types of these elements. Some of them are found at the boundaries between chromosomes. Some of them are interspersed in the midst of chromosomes. So in terms of gene expression, things around the nucleolar periphery tend to be non-expressed, genes in the nuclear interior tend to be actively expressed. And those that have been moved toward the nuclear periphery are again silenced. So what does that mean regarding nuclear dynamics? Well, the repositioning of a gene locus is often associated with activation or silencing. And the importance of when and where these interactions take place in the nucleus is currently a subject of intense investigation. Structural constraints impose limits on chromatin mobility, just like those anchors on those boats. Things need to be tethered within the nucleus so that they are not floating around randomly running into each other. And so it's this understanding of how the dynamic nature of the positioning of the genetic material in the nuclear space, along with the higher order architecture of the nucleus, are integrated, is essential to our overall understanding of gene regulation. So let's take a look at what's going on around the nuclear lamina. The nuclear lamina are being shown here, right at the boundary with the nuclear envelope. And in a repressive domain, genes are organized in such a way that they are not expressed. However, if they escape this repressive domain, they are then available to be activated, although there are other factors involved, as some do become activated and some do not. Notice here a neighboring domain that has a different microenvironment so that this gene can be activated when it is still near the nuclear periphery. And this then would be an activating domain. Notice also the nuclear pore complexes and many genes are associated with those nuclear pore complexes. And it is the proteins that are part of the nuclear pore complexes that are necessary to activate transcription of those genes. So here's a point just to think about. The possibility that spatial networks of genomic loci exist in the nucleus implies the presence of a previously unexplored level of gene regulation that coordinates expression across the genome. So we need to move our thinking from that of a sequence-based analysis to that of a three-dimensional analysis. So let's take a look at some of these interchromosomal interactions. I'd like to demonstrate some of these for you. Spatial interactions are described as chromosome-kissing events or in some cases gene-kissing events. Now this happens in gene silencing where two chromosomes come together and this could be around a polycomb body and there's only one shown here but it could be a rather large complex of polycomb molecules to keep gene expression silenced. This also happens in X inactivation. So this is on a larger scale where the majority of the chromosome is being silenced. So before inactivation, two copies of the X chromosomes are present. At the onset of X inactivation, these two copies come together at the X inactivation centers and the inactive copy and the active copy are designated. And in mid to late S phase, they again move apart and the inactive X chromosome moves to the nucleolar periphery. So again you see things coming together and then you see movement to distinct locations in the nucleus. It's not just the movement or where things are located. There is also a role for non-coding RNA and this is best known in X chromosome inactivation. So the non-coding RNA exists, quotes the DNA of the inactive X chromosome to a point where genes are fully silenced. Interchromosomal interactions such as chromosome kissing also happen in gene activation. This is known in T lymphocyte gene activation where genes are reaching out to each other across different chromosomes and meeting at a transcription factory. And this doesn't only happen for transcription factory function but the transcription factory here could be replaced by a collection of CTCF sites or splicing machinery known as splicing speckles, lots of different functions being coordinated in the nucleus. And so the convergence of loci within transcription factories promotes both cis and trans interactions and some of these are quick and some of these are long term and many of them are tethered by these functional components that comprise enzymatic activities and this opens the possibility that enhancers of a gene could lie on an entirely different chromosome. We're all familiar with this idea that enhancers could be located very far away on the same chromosome but now we need to start thinking about the fact that they could be on an entirely different chromosome. So let me show you quickly what a chromosome, interchromosomal interaction looks like in a transcription factory. Now this is just a model and of course it's not perfect but it's different than the way we typically think about gene transcription. So this is from Peter Cook's group and basically here's a transcription factory. We have a transcription factor tethering the DNA. We have a transcription unit here and here and a promoter here. This promoter is already engaged with the RNA polymerase and let's see if this will work for us. And so here we go. You see that the genes are actually being brought to the RNA polymerase molecule which is immobile and the DNA is ratcheted through the site of transcription. The messenger RNA is collected and then transported to where it needs to be. So that's just something to keep in mind. So these long distance interactions or these interchromosomal interactions are difficult to detect but there are new technologies emerging to identify when and where they're happening. And so the technique of chromatin confirmation captures probably familiar to a lot of you. This is one of the main experimental techniques that is used for this purpose today. And so there are a number of steps where cells are first formaldehyde cross-linked. This brings the proteins into covalent attachment with the DNA. The cells are then laced and the non-linked proteins and DNA are removed. A restriction digestion then reduces the size of those relevant fragments to something manageable. Ligation occurs at low DNA concentration. This removes the possibility that random fragments will be ligated. The DNA is purified and PCR amplified. Now in parallel, derivative methods adopting the same idea are being developed. And these are coupled to the technology of DNA microarray or quantitative DNA sequencing known as 4C or 5C. These techniques follow the same basic principles where DNA and proteins are first cross-linked. Ligation occurs. There is circularization of those products in universal primer annealing followed by PCR amplification and analysis by microarray or sequencing. And this allows the screening of interaction partners for selected genomic sites. So another point to think about. If chromosomal architecture is relevant to gene regulation, diseases stemming from mutations in these genes should be known. Well, I can answer that one. Mutations in genes encoding nuclear envelope proteins cause a fascinating array of diseases referred to as nuclear envelopathies or laminopathies. And these affect different tissues and organ systems. Now for the most part what's really surprising about these diseases is that they tend to be tissue specific. The effects do not manifest until well after birth and in some cases they only appear during adulthood. So here's an example affecting striated muscle and causing cardiomyopathy. Other cases affect adipose tissue or peripheral nerves. You're probably most familiar with Hutchinson-Gilford progeria syndrome. This happens to be a multi-system disease and so it affects growth and bone and fat. And all of these features have led to the notion that progeria is a disorder of accelerated or premature aging. It turns out that some tissues that are commonly affected during physiological aging during later decades of life such as the brain are not at all affected in children with this disorder. So it's a disease of the laminar nuclear, not fully a disease of aging in the brain. Now the mutations that cause these diseases are dispersed and they may be missense or small in-frame deletions. They may affect splice sites. They may truncate open reading frame or affect promoters by resulting in decreased levels of the transcription of the laminate gene. So let's move to talk about the spectrum of sequence variants. In 2007, science published that human genetic variation was the breakthrough scientific discovery of the year. And the availability of high throughput genotyping technologies coupled with the results of major polymorphism characterization efforts such as the international HAPMAP initiative have made it possible to conduct genome-wide association studies seeking to identify common variants that are statistically linked with particular diseases. Yet to date, hundreds of GWAS studies have been performed with many having identified unequivocal and statistically compelling associations between particular genetic variations and diseases of all sorts. However, there are still unexplained genetic bases for many diseases and this has led to debate over whether the causative agents of common disease are common or rare variants. The polygenic diseases such as hypertension and cancer and diabetes may gain their genetic heritage from the action of multiple genes as well as multiple variants of those genes including mutations that occur in regulatory elements. So just as GWAS studies have identified numerous disease-associated genes, other sequencing studies have linked multiple rare variants to human diseases such as type 1 diabetes, blood pressure and obesity. But one important aspect of studying the impact of rare variants is the identification of multiple types of rare variants. So any one of these variants within the same gene might often be possessed by the individuals having that disease of interest. And this suggests that all of those variants could contribute to the pool that is causative for that disease. Collectively, they need to be seen more often in the disease situation than in the unaffected situation. And they all need to perturb the gene of interest in roughly the same way or to the same amount to induce that disease phenotype. So here we're seeing the functional elements of a gene in this case coding exons and the case sequences have many different changes that are disruptive whereas the control sequences do not. What's important for this discussion is that the detection methods are quite different. Rare variants will be detected by genomic sequencing and common variants can be assessed through genotyping platforms. So another point to think about. The conclusion that common diseases are multifactorial in origin implies that many, many more disease associated variants remain to be identified. So there's a big job ahead of us. So what are those mutations going to look like? You're all familiar with coding mutations because they affect gene function. But what I want to talk about is how they may affect regulation of gene expression. So first, point mutations are the single base changes in DNA sequences. They're the most common type of alteration in DNA that has been studied. And they can have varying effects on the resulting protein. So for example, a missense point mutation substitutes one nucleotide for a different one but leaves the rest of the code intact. And the impact of these mutations depends on the specific amino acid that has changed and the protein sequence that results. So if the change is critical to the protein's catalytic site or to its folding, damage may be severe. Frame shift insertions or frame shift deletions affect all of the open reading frame downstream. And non-sense mutations are point mutations that change an amino acid codon to one of three stop codons, which results in premature termination of the protein. Nonsense mutations may be caused by a single base substitution or by the frame shift mutations. But think about silent substitutions at synonymous positions. They have the capacity to alter the processing of a protein. So these are seen as two-fold degenerate sites or four-fold degenerate sites, which change the third base or the wobble position of the codon. And these are seemingly innocuous towards the protein function. But if they affect how much of the protein can be made in the cell, they can be extremely detrimental. So for example, when altering the consensus recognition site for splicing factors such as a spliced donor or spliced acceptor site, even synonymous substitutions may debilitate the protein by resulting in exon skipping events. And so these would be occurring right at the edges of the exons. Moreover, mutations that create a sequence that resembles a splice site can cause activation of splicing at a cryptic or ectopic splice site. And these would generate a longer five-prime end when they occur as non-coding mutations or a shortened three-prime end as they occur in coding sequences. So these splice sites would be used in preference to the normal functional splice sites. Now many human diseases, they're genes harbor exonic mutations that affect pre-messenger RNA splicing by altering the recognition sites for exonic splicing regulators. And these can be nonsense, missense, or even silent substitutions that interrupt exonic splicing enhancers, which are bound by SR proteins, and typically work to promote inclusion of an exon in a transcript. Moreover, these mutations can create exonic splicing silencer sites. These sites are bound by HN-RMP proteins and work to restrict inclusion of an exon in a transcript. So here's a known example. This is the survival motor neuron 1 gene, and its homozygous loss causes spinal muscular atrophy, a very serious disease. SMN1 has a paralogue known as SMN2, and a single substitution in this sequence, which is a silent substitution, is actually enough to disrupt binding of a regulatory protein at an exonic splicing enhancer. But it turns out that it's not just the disruption of splicing enhancers, but the creation of splicing silencers that may also be a problem. So in the normal state, the SMN1 exon may not bind a splicing silencer, but the single base change in maybe an overlapping recognition site is enough to create an exonic splicing silencer that can be bound by an HN-RMP protein. So in this paralogue, the SMN2 gene, the exon that has this silent substitution is actually skipped. And a means of therapeutic intervention might be to prevent the skipping of this exon, and there are certainly published reports of therapeutic approaches that are capable of preventing these exon skipping events. Some of them happen by changing the ability of the polymerase to read through a transcript. Others happen by using synthetic oligos that block the interaction of the regulatory proteins. So just another point to think about, the fact that synonymous substitutions in coding sequences could interrupt regulatory processes implies that resequencing projects might be ignoring the most critical variance at this point. It's still very difficult to predict or evaluate splicing mutations. There are a limited number of computational tools available to do this. Work in my group has developed a new web tool that is available for evaluation of sequences that may affect splicing, and this is in-press in genome biology. This work was published by Adam Wolf, a postdoc in my group. And so let's just take a look at what the landscape of an exon looks like in terms of splicing enhancers or silencers. So in green, I'm showing splicing enhancers, and in red, splicing silencers of different strengths. And here's an example from an exon in the SMN1 gene. And you see that this is a relatively short exon, less than 100 base pairs. You see that the landscape of this exon shows both splicing enhancers and splicing silencers. And in the normal case, the edge of this exon has enhancers and silencers. But in the mutation, the enhancers have been lost and silencers have been gained. Now these are overlapping hexameric windows, and so a single mutation is resulting in the changes that are predicted here. Here's another example from the ATR gene. This is a known example of an exon skipping event. Again, a single base change is enough to cause a region that has multiple splicing enhancer candidate binding sites to change to a region that has less enhancer sites and more silencer sites. And here's an example from the ATM gene. In this case, a strong series of splicing enhancers are lost due to this single base pair change. So we can, again, return to talking about progeria. So there are splicing mutations that occur in progeria, and this was work published by Francis Collins and colleagues, where they found that the normal sequence of the lamin A gene can be mutated in patients from a C to a T change or a C to a G change. And so the consequences of this mutation are that this region becomes more similar to a consensus splice donor sequence. So normally, the sequence of lamin A does not look like a consensus splice donor sequence. It differs in these two positions. However, in these mutations, the change of a C to a T makes this region look more like a splice donor site. And in the second case, the change of a G to an A also makes it look more like a splice donor site. Differing each of these differing by only one nucleotide from that splicing consensus. The results of either of these mutations are a truncation of exon number 11, leading to the defect of the disease known as progeria. So this is a splicing mutation that affects the nuclear lamina that is all about the regulation and the production of this gene. So let's switch gears and talk about the non-coding regulatory landscape. So you're all familiar with many functional elements in the genome. For example, insulator elements that serve as the barriers between closed heterochromatin and open active chromatin regions, insulators that act as enhancer blocking elements to contain the function of enhancers and redirect it towards the necessary interactions with promoter elements and activate transcription. And silencer elements which are in these regions of open chromatin but act to keep gene expression repressed. Promoters themselves are deemed as the gateways to genomic transcription and as such they really deserve much more attention to their contributions to variation in gene expression levels across populations or in disease. You'll notice many types of core promoter elements. These include BREs, TATA elements, initiator elements, MTEs, DPEs. This is a motif 10 element. This is a downstream promoter element. However, there are no universal core promoter elements. Each of these elements is found in only a fraction of promoters and the numbers are estimated from 1% to 30%. Yet the majority of promoters in mammalian genomes contain CPG islands. Up to 70% of all promoters contain CPG islands. Now although not a conserved or core promoter element, the CPG island creates a chromatin environment that is conducive to expression in roughly 70% of all promoters. Note that many of the core promoter elements fall downstream of the transcription start site and this places them within the 5' UTR. This was a conclusion that was also supported by results of the ENCODE consortium where they found that many regulatory elements actually fall downstream of the transcription start site. And therefore the definition of a promoter has been expanded to include both sequences upstream and downstream of the transcription start site. Now classes of promoters initiate transcription by different approaches. There are narrow peaked or focused promoters that typically contain tata boxes and they usually have a single start site. There are broad or dispersed promoters that usually yield multiple weak start sites over a region of 50 to 100 nucleotides. These broad promoters typically contain CPG islands but lack tata motifs. And interestingly, focused core promoters are more ancient and widespread throughout nature than dispersed core promoters. But invertebrates, dispersed promoters are more common than focused promoters. Perhaps that means that dispersed promoters are the newer promoters which may lead into the acquisition of things such as alternative promoters, which I'll talk about shortly. So gene activation ultimately requires the interplay of enhancers with promoters. And so once the proteins are recruited to the enhancer site, the models for interaction are varied and they range from facilitated tracking to the most commonly invoked form of looping. There are other versions such as linking and tracking where the proteins actually move across the DNA. And we've already talked about the immobilized transcription factory which has a completely different mechanism than the models I'm showing here. Now once an enhancer is interacting with its promoter, it functions in a variety of ways. The three mechanisms I'll show you here are that transcription factors recruit the nucleosome remodeling complexes to remove a nucleosome from the transcription start site and allow the recruitment of the transcription machinery. Alternatively, transcription factors can recruit co-activator complexes that have histone acetyltransferase activities such as the protein P300 that modified the histones using acetylation, which then allows access of the transcription machinery and activator proteins. And the third mechanism is by bringing in the mediator complex, which carries along with it the RNA-POL2 molecule, which then is available for the activation of transcription. So the dominating, activating role of enhancers requires specialized containment of that activity. And as we've said, CTCF protein is known to act as an enhancer blocker element so that interactions aren't taking place all over the genome. This activity was confirmed in studies using RNAi to block CTCF function, which resulted in the spurious activation of additional promoters. But promoters also contribute to combinatorial regulation. They play a role. So this is a lesser acknowledged role, and it has to do with the identity of the core promoter itself. So in the vast sea of the genomic landscape, enhancers may selectively interact with only subsets of promoters, such as the INR-DPE combination shown here in preference to the INR-TATA combination. And in these cases, the role of CTCF becomes more than a passive blockade. It becomes a targeted facilitator of interactions that need precise regulation. So it's this combined effect of promoter identity and genomic landscape with the accurate placement of CTCF sites that is necessary for regulated transcription. Now, I've mentioned alternative promoters. Alternative promoters also use core elements to contribute specificity. So normally, we might predict promoter regions by taking the five prime transcription start site and look for the upstream sequence, and we might now extend that a little bit downstream. This approach could be applied then to transcription start sites that fall in unusual places. This might be within an exon or upstream of uniquely positioned alternative first exons. And so now we begin to see a greater complexity of gene regulation, where for a single gene, the regulatory elements start to dictate where and when things are expressed. The differences in these core promoter elements play into whether these transcription start sites will be co-ordinately activated or perhaps mutually exclusively activated. And keep in mind that they would be competing for the same enhancer elements. So here's another point to think about. The preference of particular enhancer promoter combinations implies inherent specificity of interactions that could be used for predictive purposes. We're not quite ready for this yet, I think in our current state of regulatory genomics, but as a computational person, this is the kind of regulatory study that I like to think about that I think will be possible in the near future. So because promoters are positionally constrained by their function, their positions can be estimated fairly accurately in the genome because we know where the start site of a transcript is. But enhancers have a more diversified presence in the genome. They can be within genes, they can be upstream or downstream of genes, they can even be in gene deserts. Thus their identification requires different approaches. So conservation is one way to identify enhancers. And here is a region that I'll describe between the cell 1 gene and the CHD9 gene, which has a series of ultra-conserved regions. So here's an image that I've cut out from the UCSC genome browser just to show you the extensive level of sequence conservation present here across many mammals. And no genes are annotated in this region. So work done to test these regions in transgenic mice showed that some of them had absolutely no activity as enhancers at day 11.5 when they were tested. But many of them were positive as developmental enhancers, and these activities were very specific for midbrain or neural tube or limb expression. These results also pointed out the modular nature of enhancers, whereby each one targeted a specific expression pattern or reinforced a regulatory control with redundant and equally specific activity. These are called shadow enhancers. So here's an example of an enhancer that is very well characterized. This is the interferon beta enhancer. And you can see that there are a series of adjacent transcription factor binding sites with interspersed with HMG proteins that bend the DNA, resulting in this looped interaction at the promoter. Now if we look at this in the UCSC genome browser, it's interesting to see that there is a strong peak of conservation just upstream of the transcription start site. And this says that this configuration is the same in all mammals, but it's not carried through to all vertebrates. The UCSC browser also has annotations for conserved transcription factor binding sites, and they carry a lot of false positives. They're just predictions. I don't use them all of the time, but in this particular example, it's very nice in that the predicted sites match quite well with what is known for the binding sites at this location. And furthermore, additional predictive tracks such as regulatory potential also indicate the presence of putative functional regions coinciding with the locations of these transcription factor binding sites. So this is the Esper regulatory potential scores, and you can see that here's a strong peak that corresponds to the location of the transcription factor binding sites in that locus. Note that there are also peaks in the coding sequence that happens quite frequently, but there are also peaks downstream of the gene, and these could be undetected regulatory elements in this locus. And so this is just an example of some of the tracks that you could use at the UCSC genome browser for looking for the presence of functional elements. So this analysis of ultra-conserved elements in the cell 1 region yielded a predictive rate of only 50% using conservations, and this was primarily due to the limitations of the assay that they were using. But as a means for identifying most enhancers for the last 10 to 20 years, conservation has facilitated the identification of only a portion of all enhancers in the genome. And this is because, like promoters, enhancers can undergo diversification. So the INF beta enhancer represents the extreme form of conservation, and this is known as an enhancer zone, where things are so positionally required that they're unable to change. And contrast that with a more flexible and relaxed configuration of binding sites known as a billboard type of enhancer. So these enhancers undergo motif turnover that may reposition the same motif somewhere else, or it may actually replace that motif with a different motif that has the same function. When this happens, it's very difficult to use sequence conservation as a guide to find enhancer elements. So in lieu of sequence conservation, other techniques are rising in prominence for the identification of enhancers. For example, the histone acetyl-acetyltransferase protein P300 is present at enhancer elements. And in this particular study, chip-seek experiments were used to identify P300 present in regions that were isolated from the forebrain that corresponded to expression in transgenic mouse experiments. And so you see the peaks of the genomic positions that were pulled down in chip-seek experiments, and those regions were cloned and tested in transgenic mice. A similar result came from chip-seek isolates from the midbrain when cloned gave expression in the midbrain and sites from the limb as well. And so notably, each of these sites has sequence conservation, and that is a typical characteristic of developmental enhancers. But what this study showed was that the presence of P300 could be used to identify enhancer elements regardless of the underlying sequence and regardless of sequence conservation at that location. So some would argue that the difference between humans and chimps is not due to coding sequences or regulatory changes. I would actually say it's not due to nutritional differences, but perhaps it's due to epigenetics. And so epigenetics commonly refers to the study of mitotically and myotically heritable changes in gene function that are not attributable to changes in DNA sequence. An epigenome is the representation of all epigenetic phenomena across the genome, but keep in mind that this is within a single cell type. It changes for every cell type. So the two main components of the epigenetic code are DNA methylation, which happens at CPG dinucleotide positions, or histone modifications that happen at the protruding tails of the histones in the nucleosomes. Now it's currently debated whether the chromatin modifications happening at histones also belong to the class of epigenetic modifications, and this is due to the lack of evidence that these changes are heritable. However, these changes are still very important given that many of them have been shown to play important functional roles in genomic regulation, and that histone modification and DNA methylation may be related. Their importance is unquestioned even if their label as true epigenetic marks is uncertain. So it has been proposed that distinct histone modifications on one or more tails of the nucleosome, sequentially or in combination to form a histone code that is read by proteins to bring about distinct downstream regulatory events. For example, acetylation and methylation of histone H3 can occur in lysine and arginine residues, and the functional consequences depend on the type of molecule and the specific site that it occupies. So I'm showing here methylation that's occurring at histone H3, lysine 4, lysine 9, lysine 27, and lysine 36, and I'll go into more details in the next few slides. Modification of histone H4 can also occur at lysine and arginine residues. Again, acetylation is an activation mark, whereas methylation of H4K20 is associated with transcriptional repression. So distinctive marks can be used to differentiate active promoters from inactive promoters. So here the transcriptional activation is coinciding with marks of H3K4 monodye and tri-methylation, and you'll notice that these marks are quite spread from the transcription initiation site. Monomethylation is occurring at H3K9, H3K27, H4K20, and H2BK5, and this monomethylation mark is found throughout the transcribed region. H3K79 dye and tri-methylation occurs not only at the transcription start site but also downstream. H3K36 tri-methylation is happening towards the end of the gene. The histone variant H2AV is present, H2A.Z is present at the transcription start site, and acetylation of H3K9, K18, and H2BK12 are all happening at the transcription start site, whereas H4K12 acetylation and K16 acetylation happen throughout the body of the gene. So you can see that there are many different marks going on. I've only shown you a subset. It is suggested that there are over 150 different marks on the histones that could play combinatorially into how they regulate gene expression. Histone acetylation marks can also correspond to levels of gene expression. So in this case, in red, you see that the normalized counts of tags for H3K9 acetylation in chip-seq experiments correspond to genes with the highest level of expression, whereas those that are silent, shown in purple here, have very few tags. This mark happens to peek around the transcription start site and then is not present in the body of the gene. This is consistent with the action of the histone acetyltransferase P300, which interacts with the initiating Paul II molecule and then dissociates. Now here's an example for marks for H3K27 acetylation. Again, you see the same pattern where genes with the highest levels of expression have more marks than genes that are silent. In this particular case, there is a peek at the transcription start site and the marks continue throughout the region of the transcribed unit. This is consistent with the action of the histone acetyltransferase PCAV, which associates with the elongation form of RNA Paul II. In this example, H4K12 acetylation again shows the correspondence with higher levels of transcript expression, peaks at the transcription start site, and remains elevated throughout the transcriptional unit. This could be the work of both P300 and PCAV together. Again, as we learn these rules, we begin to move toward a systems biology approach where we could use this information to predict which histone acetyltransferases are playing a role in marking the chromatin and could ultimately tie that together with perhaps mutations that affect the function of these particular genes. Now, I mentioned combinatorial patterns of histone modifications that happen at promoters. This is a study describing 17 different patterns that were present at promoters where 3200 promoters were examined. So in yellow, the chart is showing that each of these promoters contained these modifications, and any that lacked one of these modifications are marked in blue. And you see that the loss of any one of these modifications reduced the dataset by almost 90%. So there is a lot going on at these promoter regions, a lot of different combinations, each of which have meaningful input. Now, just as active genes are subject to repression, histone modifications are reversible events. So histone acetyltransferase molecule provide activity that coincides with gene activation, and histone deacetylases lead to the compaction of chromatin and subsequent silencing of gene expression. And we'll talk more about what's going on in this silent state. So inactive promoters carry histone modifications as well. These include monodye and tri-methylation of H3K4, ME1, 2, and 3. And H3K9 die and tri-methylation throughout the locus. And so the signals represented by histone modifications are recognized by different nuclear factors containing specialized structural domains. These include chromo domains that are found in the proteins polycomb and HP1, and these recognize tri-methylation events on H3K9 or H3K27 that correspond to a silent state of gene transcription. Chromo domains also recognize activation marks, such as H3K4 tri-methylation, as do proteins that contain PhD fingers. Acetylation marks are almost always associated with transcriptional activation, and in this case, the protein that translates that modification into some function is a bromo domain containing protein. So just something to think about. The fact that every cell type has a unique pattern of histone modifications attributable to the functioning of that cell implies that changes in those patterns could reveal disease processes. There are a number of stages that are necessary before we get to that point. First, the foundational pattern for each cell type needs to be determined, and from that point, we can look for changes that would reveal disease processes. So it turns out that activating and repressive marks are not mutually exclusive. Their presence is combined in ES cells to prepare a region for its future differentiation pathway. So here I'm showing the marks for K4 methylation as activation marks, whereas K27 methylation represents repressive marks, and these are necessary and present in genes that may be off or have low-level expression in ES cells. Upon differentiation of those cells, these bivalent modifications become resolved into situations where there are only activating marks and the gene is on, into situations where there remains a sustained bivalent modification. Those genes may remain being expressed at very low levels or into a situation in which only silencing marks are present, and in those cases genes are off. So in this central scenario, these genes may be waiting for a further time point when they are expressed, and their transcription is important to the function of the cell. So just to summarize histone modifications, it turns out that H3K27Me3 appears to be dominant because all patterns containing this modification tend to be repressive. Also, the H3K4 trimethylation modification alone is not sufficient to support active transcription because the genes associated with this mark only tend to be silenced. And thirdly, the histone modification patterns alone do not determine the expression levels they may correspond to them, but they do not determine them as genes associated with many patterns show an extremely broad range of expression from silent to active. So here's just a screenshot of the UCSC genome browser where you might be able to find some of this data if you're not producing it yourself in your own lab. Under the track for regulation, you'll find many, many data sets, including the Barski et al. chip-seq data. These are methylation marks on the genome. Broad has histone modifications as well, and I'll direct your attention to open chromatin. This is the DNA1 hypersensitivity that's being produced by Greg Crawford's lab at Duke University. But there are many other data sets that are available that carry histone modifications or chip-seq data for particular transcription factor binding sites, and I encourage you to look through those and see if any of that is useful for your own work. Keep in mind that these were produced on particular cell types, so the modifications you find in these tracks aren't necessarily representative of what would be happening in the cell type that you're studying. But typically, if a mark is there, it gives an indication that a functional element is also present. So this is another way to screen through the genome looking for functional elements. And that's particularly what this point makes in this next slide. So I'd like to move on now and talk about epigenetics involving DNA methylation. So just recently, Time Magazine published this cover why your DNA isn't your destiny. And it turns out that the heritability of traits is not just what is encoded by your genes, but how these genes are presented to the cell. So DNA methylation is precisely patterned in each cell type, and these patterns differ even between infant and adult of the same cell type. So here you see a heat map where unmethylated CPG, dinucleotides are shown in yellow, and methylated sites are shown in blue. And you see that there are differences in these patterns, even between infants and adults. These patterns also differ across tissues of the same individual. What this says is that when looking at DNA methylation in a diseased tissue, blood is not always the best comparison to use. So newer techniques for detecting DNA methylation incorporate the gold standard known as bisulfite sequencing. This is a chemical conversion of the DNA that can then be analyzed using microarray analyses or high throughput sequencing. DNA methylation is occurring at the CPG dinucleotides, and anywhere in the genome this can be detected using the bisulfite sequencing assay. So the steps for this include fragmenting the DNA, annealing primers, performing bisulfite conversion, which then takes positions that were CPGs and makes them into use, followed by Gs, except at the positions where there was a methyl group. Those are protected from this conversion, and they remain CPG positions. Followed by this conversion, our PCR amplification and detection using microarray analyses are high throughput sequencing. So there are still some challenges to doing this technique without PCR amplification, but the next realm of studying DNA methylation will be to take these sequences and put them on a next-gen sequencing machine. So as I mentioned early on, there are relationships between histone modification and DNA methylation, and so in this way un-methylated, transcriptionally active DNA can become methylated through the action of a DNA methyl transferase, making it transcriptionally inactive. The recruitment then of histone deacetylase molecules or methyl binding domain proteins coincides with that inactive transcription, expelling transcription factors, creating histone deacetylation and chromatin condensation, and the ultimate result is stable transcriptional repression. Now the recognition of general DNA hypo methylation was the first epigenetic alteration identified in cancer cells. If we look at the landscape of the genome in terms of methylation, typically CPG islands are hypo-methylated, especially those falling upstream of tumor suppressor genes that are actively transcribed in a cell. Other regions of the genome are hyper-methylated. These are the paracentromeric regions, areas of heterochromatin, and DNA repetitive elements. But under some unusual conditions, these regions may undergo hypo-methylation. The removal of the methylation allows transcription to become active, and the sequence similarity of these repetitive elements across the genome renders them susceptible to mitotic recombination and genomic instability. Furthermore, at the same time, it appears that CPG dinucleotides in the promoters of tumor suppressor genes undergo hyper-methylation. This turns off their transcriptional, creating a loss of the function for that tumor suppressor gene in that cell, and these are both hallmarks that are found in cancer cells. So histone modifications and DNA methylation make up an integrated biological system within the cell. CPG island hyper-methylation results in a decrease in tumor suppressor gene transcription. This may correspond also to a decrease in microRNA expression because microRNA are often embedded within those transcriptional units, which may result then an increase in oncogenic potential of genes because they are no longer regulated by those particular microRNA that keep them in check. Now, these events are actually parallel with what you might see from genetic mutations that alter the protein-coding capacity in a cell. Concurrently, then, global hyper-methylation makes all of these repetitive elements in the genome now active and accessible, maybe not all of them, but many more than were previously active, resulting in chromosomal recombination, and these changes reflect what is seen during conditions where chromosomal translocations have been acquired in non-cancer cells. The field of DNA methylation in the human genome is in its very early stages, and so there are still basic questions that remain to be answered. These include, why do CPG islands become methylated in cancer? Why do certain CPG islands become methylated? Well, others do not. Is aberrant hypermethylation a targeted or a random process? So it's this mosaic of epigenetic and genetic alterations that compose a sort of jigsaw puzzle that needs to be solved in two steps. First, it is necessary to obtain the complete description of the type and genomic location of these alterations, and second, the hierarchical relationships between the different types of alterations have to be identified in order to distinguish the extent of their impact. Now, as Eric Green mentioned early on in his talk, genomic medicine is healthcare that is tailored to the individual based on their genomic information. The sequence of the human genome has been solved, but we have not yet realized the full implementation of genomic medicine. And it turns out that emerging techniques, such as targeted resequencing, whole genome resequencing, and high throughput genotyping are going to be necessary to play into this, as well as the efforts of many international consortium, such as the HapMap project, the ENCODE project, the Thousand Genomes project, the Cancer Genome Atlas, small molecule work being done by NCGC, and of course, interdisciplinary research. In this way, we will be able to characterize diagnose and treat personal mutations, thus fulfilling the promise of realizing genomic medicine. So let me just summarize by reviewing these discussion points. We talked about nuclear architecture and visited the ways that it plays a role in gene expression. We also talked about the spectrum of genomic mutation, making the point that there are many variants out there that have yet to be discovered. We talked about regulatory elements, and it's important to be able to characterize these regulatory elements so that when mutations are identified in non-coding regions, it is easier to predict what types of functions they are affecting. We talked about epigenetic modifications. These are layered on top of the genomic sequence, where the sequence tells us everything that is present in the genome. These modifications tell us what is happening in the genome. And also DNA methylation in cancer. This is a very important aspect of the future of cancer research. So with that, let me just end by saying the landscape of regulatory and epigenetic features is slowly revealing itself in a powerful force in shaping the genome and its regulation. And I hope that I've given you a lot to think about today. Thank you.