 Welcome to MOOC course on Introduction to Proteogenomics. In the last few lectures, Dr. Kelly Regils have given you a very detailed insights of genomic revolution studying genome and transcriptome. Today she is going to talk about epigenome. The epigenomics deals with the modifications in expression and function of genes due to the addition of different functional groups. Today will be Dr. Kelly Regils fourth lecture and she will talk about epigenomics and use of chip-seq technology to perform epigenomics studies, various type of publicly available databases like CPTAC, the cancer genome at least TCGA and in code will be described. She will also talk about the DNA-seq where one can sequence gene by using DNAs and then compare them with the referral genome. The lecture will also illuminate about DNA methylation and whole genome by sulfide sequencing known as WGBS which can facilitate in finding sites of methylation. However, because of high cost and inefficiency of reduced representation by sulfide sequencing is preferred over WGBS. Dr. Kelly will further talk about the role of epigenetics in histone modifications leading to expression of different genes. She will also cover the high C which helps in understanding the interaction and folding of chromosomes with neighboring chromosomes or within itself. So, let us welcome Dr. Kelly Regils for her lecture today. What I will talk about is epigenomics. So there are lots and lots and lots of methods for epigenomics. I feel like I have collaborators that come to me with more and more of these methods that are slightly different and have new names but I am just going to talk about some basic ones but you will read, if you go into depth with this, you will read about more and more of these that there seem to be hundreds but they are all slightly different. They are all doing similar things. So, chip-seq which is we are going to talk about in a lot of detail but this is really trying to identify DNA-associated proteins, DNA-seq which identifies active genes based on DNA hypersensitivity, high C which is cross-linking DNA to understand how because the DNA is sort of folded on itself in the nucleus so understanding how certain parts of the DNA interact with each other and then by sulfite sequencing which is looking at methylation status and I will talk about each of those in a little bit more detail. So chip-seq is combining sequencing with a chromatin immunoprecipitation so what you do is you have to pick proteins of interest so let us say you are interested in a certain transcription factor or you are interested in a modified histone, there is all sorts of things that any protein that you are interested in that interacts with the DNA you can pull down that protein and then look at what sequence is associated with it is essentially the theory behind this. So, you extract DNA from your nucleus and when you extract the DNA there are proteins that are bound to the DNA and then you cross-link them with formaldehydes you know that they continue to stay bound to the DNA and then you do this chip immunoprecipitation meaning you have an antibody against whatever protein you care about you pull down all of the protein or not all of it but some of the proteins within your sample that is attached to DNA sequence and then you take the DNA fragments that were attached to that protein and then you sequence them and you align them to the DNA so you know whatever protein you are interested in these are the sequences that it is interacting with and this is typical so this would be a CTCF would be the protein and then you would pull down and you can see that there is a lot of reeds in this area here you could look at RNA polymerase you could look at a methylation status so you can look at a whole bunch of different anything you can kind of pull out your sequence after you do this hybridization you can look at with this kind of a method any questions DNA seek is a way of looking at gene activation so there are it's been shown that that regions of the genome that are hypersensitive to DNAs are active so what you can do is you can cleave DNA with DNAs and then you can look at the you can kind of look at where the sites that it's cleaved and then you can take the chunks that are closest to that cleavage and actually just sequence those so you can see where what regions of the genome have are sensitive to DNAs because they were able to be cleaved and then pull down using these these dino beads so in addition to that you again you can look at the structure of the chromatin so genes are packed within this heterochromatin and the genes that are packed within it are not expressed so there's lots of different modifications that occur on chromatin to sort of open it up and make it available for expression and there's lots of ways of measuring openness and activity based on understanding how this chromatin structure occurs so for example histone acetylation the loose if it's acetylated it actually loosens up the structure and allows for some transcription so you can measure levels of this or you can measure the addition of methylation groups and we'll talk a little bit about how you do that and what that means so DNA methylation I think is the most one of the more commonly used epigenomics methods just because it's like one of the easier ones when you're doing a whole genome analysis there I'm not going to show a slide on this but there are also methylation arrays that are similar to the arrays we talked about where you have different methylation areas of where you can measure specific patterns of methylation on the genome so you have just a chip and you put your your DNA over it and then you can measure levels of methylation in different areas but it's been shown that adding methylation to to the DNA actually reduces transcription so if you have a methylation the transcription is lower and I'm that's a generalization there's lots of like this is a complex system and we'll talk a little bit about that but that's sort of the idea so by measuring methylation you're you can measure how active certain genes are and the ways besides this methylation chip there's also this whole genome bisulfite sequencing or WGBS where you bisulfite treat your DNA and it converts all of the unmethylated cytosines to urusil which are then read as this as T in sequence when you sequence it and then you can kind of deal with this in the in the informatic side so you know that every time there is an unmethylated C it's turned to T and then you have a control where you don't do this bisulfite sequencing and so then you can look and see you just do a whole genome sequencing of your bisulfite and your non-bisulfite treated sample and then you can look and see which ones were methylated and which ones weren't this is just super expensive not a lot of people do it because it's essentially whole genome sequencing times two because you have to have a bisulfite and a non-bisulfite sequence sample so it's one way of doing it but again it's it's really expensive so they've come up with a different approach which is called reduced representation bisulfite sequencing or RRBS where it combines that method with restriction enzymes so you can enrich for specific sites that have more more of these methylation occurring to the CPG sites using a specific restriction enzyme so it's a methylation sensitive restriction enzyme so meaning that it will cut if there's methylation and not cut what if there's not so it allows you to just get rid of a whole bunch of stuff you don't care about so it takes your whole genome sequence and it it it makes it only one percent so it takes things that you do care about so this is a more common method I worked have worked with this data and it's a bit of a mess to be honest but it's science and that's what we do so it is a method that is available as well and I'll include this here I'm not going to go through all of it but there these are different modifications and sort of how they have been shown to be involved in changing expression of genes so there's a whole bunch of different modifications methylation modifications and histone modifications that can occur okay so the last thing I'm going to talk about I think for the epigenetics is this high C so this is studying how the 3d structure of how the actual chromosomes are sort of folded on themselves and with other chromosomes so here is this is a figure I took from this paper that just shows models of how these these are each colors a different chromosome and how they're how they're the architecture of the chromosomes within the nucleus so you can see they're folded on themselves and folded with other chromosomes how they interact with each other is something that's become a great interest to a lot of people so how we measure this is using this chromatin conformation capture or high C which allows you to look at how these these chromosomes are folded on each other so what happens is you you cross link your DNA and then fragment it using restriction enzymes so you so you have your DNA folded on itself and then you again cross link it so it's it's stuck together and then you cut it up so then you have stuck together small pieces and then you um so here that's here and then you you cut with the restriction enzyme then you sort of cap the ends and you ligate them together so you take these two pieces that were interacting and you just ligate them so they're circular and then you remove these um you you end up shearing it making it into pieces and then sequence them so you end up getting an area here right this piece that you ligated where it's these two different parts of DNA that could have come from totally different chromosomes that are now ligated together and so you're really just interested in this where this ligation occurred so areas where things that don't make sense that they're connected are connected and then you do the sequencing and then you can see what it looks like is connected um together and you get things that look like this so this is um chromosome 14 how it's folded on itself um and so you can see different patterns of where certain certain parts of the chromosome is folded on itself so this is just using different restriction enzymes of chromosome 14 and looking to see how the the ligation patterns and how these these um these uh areas here are connected throughout the chromosome okay so the last thing I wanted to just touch on was these publicly available data sets so the first is the TCGA has ever has has anyone heard of the TCGA some people you're going to hear a lot about it this week so um so the TCGA is a collaboration between the National Cancer Institute and the National Human Genome Research Institute and what they they have done is generated uh genomic maps of 33 tumor types um and then there's ENCODE which is another of one of these large consortiums um where it's really looked at epigenetics so all of the the parts list of the functional elements so they've done a lot of the epigenetic work in mostly in cell lines and so the TCGA has been working for many many years and has you know has has published a lot of these um papers that are just sort of characterizing specific tumor types so characterizing breast tumors characterizing ovarian tumors looking at what mutations occur a lot and what subtypes exist in those tumors and recently um with in the last few months there was this pan cancer atlas that was published where they took all of the tumors it didn't didn't matter what um cancer it came from they analyzed them all together so this is 11 000 tumors they had a lot of these different um sequencing levels that we've discussed so SNPs copy number RNA seek DNA methylation and if you're interested there is um I included it here there's a collection of 27 papers that go through all of the many sub projects that came out of this um and so if you're interested you can take a look at that here there's a lot of information there on these papers and with that I wanted to also introduce um the clinical proteomic tumor analysis consortium or CPTAC so um many of the speakers who are here including myself have been involved in CPTAC for many years um and the goal of CPTAC is to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis so we're really trying to take what the TCGA did and make it a proteogenomic analysis of cancer and trying to figure out what more we can we can gather from using proteomics that we weren't able to understand using the genomics so you can see here there's a lot of people involved in this our meetings are giant um and we always have to take a picture which is which is good because it's good for these talks um so the first uh the last iteration of CPTAC um really just took tumors from the TCGA and looked at them at the protein level so in this this um we looked at breast ovarian and colorectal these papers have been published already um everyone except for Bing who's not here yet worked on the breast analysis so you're going to hear a lot about the breast CPTAC data um you'll be very familiar with it there were about a hundred samples per tumor type and um we took the same samples that the TCGA had done a subset of them and then we did proteomic analysis on them and then we did some integration of the of the data types that you'll hear about from many of the people today um and now the next iteration of CPTAC which is ongoing we are now looking at a prospective tumors meaning tumors that were not collected by TCGA they're being collected specifically for CPTAC um and there are nine tumor types total um and uh three of them are repeats so they're prospective samples from breast ovarian and colorectal and then there's about 100 samples that are collecting for these other tumor types as well um again doing the same types of analysis and trying to better understand cancer by integrating all of these different methods yeah the blues are the ones that we did the the the last time and that we're repeating with perspective but that we've already kind of looked at and then the reds are the new ones that are yeah totally new the homogeneity of all these type of cancers like you know the question raised many times is the cancer itself is very heterogeneous yeah i mean that's the problem we talk about a lot um and we know kind of the pathologists will look and see how heterogeneous they can actually predict like how much tumor versus non-tumor we have and we do a try and account for that um we don't do single cell RNA seek on these it's just out of the scope so yeah it's also written that you tried for normals wherever yeah we did try for match normals which has been complicated yes that is the goal is and for some tumors it's easier to use match normals than for others it's harder to find match normals for certain things like breast tumors right because breast tissue is very fatty so match normals and breast is harder than in wang for example right yes yeah yeah yeah yeah so it's a challenge but it's it's the goal is to have match normals because we it would be better to have a normal thing to compare our cancer to so we're going to look at the at the the subset of the tcga breast cancer study that was used for the cbtech study so um i just wanted to introduce the breast tcga study so this was a study that was published i think it was in 2012 where they looked at 825 breast tumors they did exome sequencing DNA copy number methylation um mRNA and micro RNA expression um and then subtypes the these breast tumors based on a pam 50 model which is a typical way of subtyping breast tumors i'll talk a little bit about what those subtypes are because you'll probably hear about them a lot this week and then identified a gene identified genes with somatic mutations um in different um samples so they were really just characterizing breast tumors from a genomic perspective looking at the epigenetic drivers so then um last year we published a paper that looked at the subset of these tumors so 77 of these tumors at the protein and phosphoprotein level and this is the paper um and this paper really looked at how the DNA mutations and protein signaling were connected um we looked at uh jugable kinases in patient specific manner and it's really um was to provide a resource for the community um just like the tcga did but now adding proteomics and phosphoproteomics to it so for this date for the this hands-on we're just going to look at the genomics data since that was what i was tasked with so we're not going to do a proteomics analysis yet but we're just going to look at some of the the tcga-based genomics data from the 77 samples that we looked at and just wanted to mention that there these pam 50 um subtyping was done for the tumors so there there are four different subtypes that you'll hear about luminal a luminal b her 2 and basal like and they all differ in terms of prognosis um and in terms of um how these patients are actually treated so there's a they've been subtyping um and you'll see that throughout many of the figures um the the basal like are typically the have the worst prognosis so that's something that we tend to focus on quite a bit okay so for this hands-on the objective is really to so this is the figure one b from the mertens at all paper and so what we wanted to do was to have you explore a lot of this genomics data that was used in this paper using a publicly available website so c-bio portal has anyone used c-bio portal before okay great so that you'll hopefully enjoy it okay so i just wanted to introduce c-bio portal so it's actually was developed by memorial Sloan Kettering um and it's hosted there in their molecular oncology center at this point um there's a lot of people working on it it's one of these huge they've done a really good job of of hosting a large amounts of cancer genomics um data so they've actually taken the tcga in addition to some other data sets and actually have it readily available on the website so you don't have to download anything we didn't want you to have to download lots of enormous files so you can just play with the data and explore the data on the website so in conclusion i hope today you have learned what is epigenetics and various methods which could be used for analyzing the DNA methylation in a given gene we also learned about high throughput approaches like wgbs and rrbs which could help in searching and analyzing DNA methylation in the gene we also learned that chromatin conformation capture high c may help in understanding the folding and interactions of chromosomes with adjacent or self genes using different set of restriction enzymes dr ruggles also briefed about tcga and in code which are publicly available databases containing very useful genomics and epigenetics information in the next lecture dr ruggles will be giving a hands-on demonstration or how to use c-bio portal for accessing gene mutations and its expression in published data sets i hope these lectures are giving you not only understanding about you know how to analyze genome transcriptome and even you know epigenome but also giving you information for various repositories databases and software tools available which could be freely available and can be utilized for your own research let us continue this discussion about genomic technologies in the next lecture and then we will have you know another transition in the concepts and we will have another speaker to give you more fundamental concepts thank you