 So, the way we've designed this workshop, we have a series of modules, as you know, and we're going to kick off with a module around chip sequencing and analysis, which is what I'm going to cover, at least the introductory part. And then we're going to follow that up with a module on whole genome bisulfite sequencing, and then finally on integrative analysis. So I'll just get right into it. So what I didn't say in my introduction is my research interest, which is around epigenetics and malignancies. And my lab studies, in particular, we focus on cancers that are driven by genetic lesions to epigenetic modifiers. So for example, Centerville sarcoma, malignant rabdoid tumors, leukemias, and those types of malignancies. And of course, we use epigenetic tools to study them. I've also been part of the epigenomics community for some time. I currently chair the International Human Epigenome Consortium, some of you might know of that work. And I was part of the NIH Roadmap Project that started back in 2008. And this is a screenshot from some of the work that emerged from that. That's a little bit of a background of my research interest and what I'm interested in. So what I'm going to try and get across in this module, we're going to do a really brief history of epigenetics. I know that there's quite a wide variety of folks in the room. Some of you are just starting your career. So I'll give you a little bit of a little background and then try to bring you up to speed on our current understanding of epigenetic mechanisms. And then I'm going to go into some basic molecular biology, driving massively parallel sequencing, remembering that epigenomic research has really been enabled by sequencing technologies. Prior to NGS, really epigenetics was limited to some very focused array-based studies. So I think from my perspective understanding of molecular biology, as well as the computational side, is essential to do your research. So the output of the molecular biology course is the FASQ file. How many of you guys know what a FASQ file is? Okay, good. So some of you do, most of you do. So we'll go through that real quick. Of course, that is the output of a sequencing instrument and that goes into the downstream analysis. I'm going to then go through some challenges and some of the underlining principles of chip sequencing. And these are applicable both in histone modifications, as well as those of you who are interested in transcription factors. I heard CTCF mention a number of times. And then I'm going to go through a very high-level overview of the analysis workflow. And then in module two, we're going to get into more details in terms of the alignment. So getting going here, a brief history of epigenetics. So epigenetics has actually been around for almost 100 years. And this is actually the first description of epigenetic signatures within the chromatin. And this was a study published in this journal, and actually a plant journal, looking at essentially stains of chromatin, in this case in moss, not in moose. But that's my one joke. Good, all right. So anyway, so we can divide the chromatin into essentially two states. We can divide it into what's known as u-chromatin. You've heard of this before. That simply means regions that don't take up dye. So u-chromatin, open chromatin, and then heterochromatin, or other chromatin, is the area that is compacted and takes up dye. So divide the chromatin into u-chromatin and heterochromatin. So that was really the first description. From there, really studies and flies started to discover that, in fact, the heterochromatin, or these regions of the genome, are actually not static. They're dynamic. And this was studied, again, using in a model system, where through translocation, a gene was actually translocated close to the heterochromatic u-chromatic border. And what was found is that actually, in a non-menelian way, this gene was silenced. And of course, many years later, we discovered this was actually due to the fact that the heterochromatin itself was spreading into the gene. So heterochromatin, u-chromatin exist, but they're also dynamic, and they can move around. So a few years later, in 1942, everyone should be familiar with Conrad Waddington. And of course, this is his famous paper, I guess, in, well, in 1942, where he first described the concept of the epigenotype, or epigenetics. And he was using this as a way to describe the process where genetic, or genotype, gives rise to phenotype. And he was using it in the context of the fly-wing development. And the journal that was actually published in was called the Indian Dever. Okay, so let's remember that all of this was discussed in the context of not even knowing what the hereditary agent was. And so it wasn't until a few years later, in fact, the next year, where the classic study looking at S and R forms showed that, in fact, DNA was a hereditary material. Everybody should be very familiar with this. So now we set up our understanding. We understand there's eukromatic, heterochromatic regions. We understand that DNA is a hereditary material. And around that time, people were trying to divide DNA into functional units. And of course, using liquid chromatography was one way to do it. This is paper chromatography. And when this was done, what you can see here is that the various forms of nucleic acids were separated. And indeed, a form appeared, which was at the time labeled as epicytosine. But if you actually go into the article, this is again published just after we understood that DNA was hereditary material, it was shown or it was postulated to actually be methyl cytosine based on some previous work using tuberculosis. So now we understand that, in fact, there are modifications to DNA and they actually, at least in the context of DNA, are a significant fraction of the nucleic acid content of the nucleosome. So around that time, or a few years later, three years later, I guess, very famously, many of you should know Barbara McClintock. And she discovered the fact that discrete elements in the genome have regulatory potential. Again, using model systems could actually show that these regions would bounce around, these are transposable elements, and these transposable elements actually had discrete regulatory function. And of course, many of you know, one of the key findings of our recent work in epigenomics is the definition of these regulatory elements, whether they be CTCF binding sites, enhancer binding sites, CPG islands, these elements that actually regulate the genome. So I would argue that Barbara McClintock actually was the first to actually describe these in any way. So around this time, of course, Mary Lyon described the process of activation in a single author nature paper in 1961, and of course, many of us know, or all of us should know, that X in activation, of course, is a way of dosage, compensation in females, and this is an epigenetic mechanism. And I just, when I Googled her and I looked up her Wikipedia page, I discovered that she had what I would consider to be the perfect death. So she was 89, you know, and Christmas Day, she had a glass of sherry, ate a sandwich, and then fell asleep in her chair. So I think that's a pretty good way to die, if you ask me. All right, that's my own, that's my other joke. Okay, so moving on. So now we understand that DNA methylation is in the genome, we understand the concept of eochromatic and heterochromatic breathing, and we understand the concept of that these discrete regulatory elements exist in the genome. In the next few years, people started to understand that DNA methylation itself regulates gene expression and then started to be postulated in the context of differentiation, or this concept of stepwise differentiation from a totypotent cell, so from a fertilized cell through to a terminally differentiated cell. So there are a number of papers by researchers at the time who postulated, of course we had no evidence for this at the time, but that DNA methylation provided this mechanism of stepwise differentiation. And around that time, in the field of cancer research, of course, there were a number of efforts looking for cytotoxic agents, and one of the cytotoxic agents that was discovered was 5A's acytidine. 5A's acytidine ended up being extremely toxic at any dose, at a very high dose anyway, but people such as Peter Jones and Stephen Baylin and many others started to use this at a very low dose, and what they found when they did this in culture at a low dose is it actually drove differentiation. So now we have an almost, well, a direct evidence that altering DNA methylation is actually driving differentiation, and so this connection between differentiation and epigenetic mechanisms came to the fore. So that's a very, very brief overview of DNA methylation. So now moving up to today, what do we understand about DNA methylation, and this is an excellent review by Dirk Schubert in 2015, if you're interested. This was published in Nature, because Dirk has made some seminal findings in the area of DNA methylation. So of course, the dogmatic view of DNA methylation is that it's associated with repression. So if we see DNA methylation in a promoter, that gene is repressed, but that's a very context-specific understanding of DNA methylation, and as you work in your respective fields, realize that our understandings are continuing to evolve. So we now know, for example, that DNA methylation is also associated with active transcription, so we know that DNA methylation is present within gene bodies. So if you're doing an epidemiological study and you're studying DNA methylation, and you find that you have a DNA methylation associated with some phenotype, remember it matters where that DNA methylation is. If it's in the gene body, it's probably associated with active transcription, so don't get stuck into that dogmatic view of epigenetics. But here are some basic concepts around DNA methylation. So you all know that CPGs are unevenly distributed in the mammalian genome, right? And why is that? I'm going to ask you guys questions. Why are genomes unevenly distributed in the mammalian genome? You guys are all epigeneticists? So is the human genome, what is the ATGC content of the human genome, or mammalian genomes? Is it 50-50? Not you, Alvin. I'm kidding. Is it 50-50? So is it AT or GC rich? So it's a little AT rich, right? Does anybody know why? Sorry? It's not GC rich. No, it's AT rich. So CPGs have been specifically depleted in the human genome, and why is that? Is that because methylated cytosines have a tendency to mutate into thymidines? That's right. So methylated cytosines spontaneously deaminate into thymidines, methyluricil, right? And those are not recognized by the basic decision repair mechanisms, right? And so over time, over the eons have slowly mutated away. However, there are regions of the genome where this has not occurred. What do we call those regions? So regions where the ancestral CPG content of the genome is still there. Sorry? CPG islands. That's right. So just remember when we say, oh, these are CPG rich regions. They're not really CPG rich regions. They're just what we expect, what the ancestral genome was. And the reason these things exist, most likely, is because they're under selective pressure. So these spontaneous deamination events are occurring, but they're being repaired because they're regulatory. Or at least that's our notion, right? So uneven distribution, CPG islands are remnants of the ancestral genome, and they're there because we believe that they're regulatory contents, right? And we see these in regions such as promoters. And in promoters, the majority in a normal somatic tissue are unmethalated. And in gene bodies, they're methylated. Those are some basic concepts around CPG islands. And of course, there's a whole field studying how these become deregulated in the context of human diseases, including cancer. So if we look genome-wide, however, if you just look at all CPG islands, and there's about 28, or CPGs in the genome, there's about 28 million of them, the default state is methylated, right? Somewhere between 7% or 80% of CPGs in the human genome are methylated. And many people consider it actually to be the default state. And the unmethalated state is actually the information content. As I've already said, transcribed gene bodies tend to be methylated. And of course, work out of the human epigenome consortium, including the roadmap project and on-code and many others, and including work of Dirks, has shown that active enhancers tend to be hypomethalated. So transcribed gene bodies are methylated. Active enhancers tend to be hypomethalated or unmethalated. And people are now using this as a way of defining regulatory states. And as I said, CPG islands tend to be unmethalated. OK, but let's keep in mind that DNA methylation is not essential, right? Not essential. So it's true that vertebrates have DNA methylation, both within gene bodies and within transposable elements. And so we are vertebrates, and we study humans. And of course, mouse is a vertebrate. But all other model organisms have lost DNA methylation. So we think that it was present in the ancestral lineage of all life. So we know that DNA methylation, for example, is present in prokaryotes used as an immune system, as it were. But this has been lost. So C. elegans has lost it. Proofflies have lost it. So we know it's not absolutely essential. And in fact, if we look at embryonic stem cells, we can knock out all the DNA methylation machinery, DMT1, DMT3A, 3B, and they're fine. They're fine in culture. However, as soon as we differentiate them, they die. So we know that DNA methylation is not essential for life, even in a vertebrate cell, even in a mammalian cell. But we know it seems to be essential for differentiation. And the other point I think that is important to keep in mind is that remember that DNA methylation is reprogrammed. And it's reprogrammed twice in mammalian organisms, once in germinal cell differentiation. So when we go from primordial germ cells to mature germ cells, either sperm or egg, we get a DNA demethylation and then a remethylation event. And this, again, occurs just after the zygote. So remember that we have this reprogramming event. And of course, this is relevant for those of us who are trying to understand the concept and the context of epigenetic inheritance. And I would say that emerging research is suggesting that at least a part of how that might be happening is actually through histone modifications, not through DNA methylation. And there are some evidence of that in model organisms. OK, so that's DNA methylation. So of course, we know that epigenetics is also encoded within the proteins that package the DNA, the so-called nucleosomes. And nucleosomes, of course, have been known for many years, not going back to 1974. So Roger Kornberg and many others have done some seminal work in defining the nucleosome structure. And then in 1997, of course, the seminal paper showing the three-dimensional structure of the nucleosome with the DNA wrapped around it. And I wanted to show this, of course, because we're all familiar with the fact that histones, so H3, for example, have these unstructured and terminal tails sticking out of the nucleosome. And of course, it's these unstructured and terminal tails that are highly decorated with chemical modifications, such as methylation, acetylation, phosphorylation, ubiquitination, et cetera. These are where the so-called histone code, as David Alice has coined the term, are laid down. So we know that chemical modifications, and this is just a series of papers that the first descriptions in various organisms, we know that chemical modifications to histone proteins influence gene expression. This was first shown in Tetraheimen by David Alice conclusively with acetylation. But of course, we now know that there's actually a complex histone code, as I'll talk about. And of course, part of what we do in epigenetics is try to study this, try to measure this. And of course, chip sequencing, which is what we're going to be talking about today, is one method to do that. So this is actually the seminal work. Again, Tetraheimen is a beautiful organism. And this is the work of David Alice showing the relationship between histone transferases, acetylation of nucleosomes, and then active transcription. So acetylation of histones, active transcription. That was really the first description in the context of gene regulation. So we now know, of course, that where we have described at least 100 different epigenetic modifications, a subset of which we've studied very intensively. And the list of them are here. These are the marks that are most often studied in the context of epigenomic research. So some of the research that I do, and I'm sure many of you in this room are familiar with them, these are listed here. They're associated with both active transcriptional and regulatory states, as well as repressed or heterochromatin states. So just to remind everybody how this nomenclature works, so hopefully everybody's familiar with this. But of course, we're describing the histone first, the amino acid that's modified, in this case, lysine 3. So histone H3, lysine 4, and then the modification. So in this case, trimethylation. So we know that H3K4 trimethylation is associated with active promoters. But we know that, for example, H3K4 monomethylation is associated with enhancers and so on. And so in regulatory contexts, these modifications have been associated with various regulatory contexts. But of course, there are many, many more. And one question I get asked all the time is, why do you study those? Why are these modifications studied more than other modifications? And really, what we have to go back to is work by Minolis-Kellis and Jason Ernst and many others as part of the OnCode Consortium and also as part of the NIH Roadmap Consortium, which, when we initially started working in this area, and of course, Ben Ren and many others, Brad Bernstein, generated some of the first data sets in this area. But when we actually looked across a whole wide variety, and here we've got, I don't know, probably 30 or 40 different histone modifications, what we saw is that many of these modifications, at least looking here, appeared to be functionally redundant. In other words, if we knew what one was, we knew what all the others were. And so those modifications were selected, rightly or wrongly, under the apparent redundancy of the histone code. So the thought is, is by studying that small set of histone modifications that we're actually describing a much richer histone modification landscape. Of course, you might ask, well, why are these other modifications there? Do they actually have regulatory potential? And I think there is emerging research that's actually showing this. And those of you, you're all young, early in your career. There's lots of opportunity to study, what are these other modifications doing? Even though they're highly correlated with modifications that are currently studied intensively in the literature, many of these modifications may have unique and very interesting functions. OK. Yes, question? Yes? What about this, like, in this cell type, when you need to do these modifications in the study and then characterize that with this? So they were only characterized computationally. They were never mechanistically characterized. And most of the studies, the initial studies were done in human embryonic stem cells, and in particular a line called H1. So the line that Jamie Thompson, one of the initial human embryonic stem cell lines to ever be generated. So that's where most of the initial data was generated. And if you look at the roadmap, you'll actually find all the, well, I think there's about 50 modifications that Bing Ren's group did at UCSD at the time. So if you're interested, it's there. But you're right. I mean, there are, what, 200, 300 different somatic cell types in the human body. Maybe these modifications are playing a role there. Seems unlikely that why would they be there if they're redundant, although functional redundancy is a common theme within the genome. We know many times we can knock out one pathway and another pathway can functionally replicate it. So maybe that's true in the histone landscape as well. So I always like to finish my brief history of epigenetics to think about what are some of the major questions in the field. Well, I would say two of the major questions in the field at the moment are, one, cause versus consequence. Keep in mind that most computational or almost all computational biology that we do, at least in the arena of chip sequencing, is all correlative, right? We're correlating the presence of this mark with some other thing, right? Gene expression, genomic feature, right? It's all correlative. We really don't know cause versus consequence. Is that mark there because RNA polymerase is there? Or is that mark there to recruit RNA polymerase, right? So the whole concept of cause versus consequence is still a very open question in the literature. And keep that in mind as you work with the data. And secondarily, or the second major question, I would argue, is really the scope. So how much transgenerational inheritance actually is there? And what are the mechanisms by which this is happening? I already talked about the fact that DNA methylation is reprogrammed. And of course, sperm itself is a unique epigenetic landscape. Most of the histones are lost. It's packed with protamines. So how are these signatures actually being transmitted transgenerational? We think epidemiological data suggests that this is happening, but really the mechanisms by which this is happening remain determined and I think are a very important field, a very important area in the field. OK, that's about the first, I guess, 20 minutes. So now I'm going to move into how do we measure epigenomic features? And in particular, I'm going to focus on chip sequencing. And of course, Guillaume is going to focus on whole genome bisulfate sequencing. And if there are other epigenetic assays that folks would like to discuss, we're here. I'm happy to discuss them. In the interest of time, I can't really go through all of them. So I'm going to focus on, or I've decided to focus on histone modifications. But we're here for the next couple of days and I'm certainly be happy to talk with anybody about your individual assay. So as part of the, well, so let's, as I've already mentioned, really the field of epigenomics and epigenetic measurements was really brought to life by massively parallel sequencing. So I guess everyone in this room, you probably were in elementary school, many of you, when massively parallel sequencing was first developed. But of course, this was revolutionary for those of us who were in the field. And in fact, the first assays that were ever run and published at the time, the Selexa platform, were chip sequencing assays. So this is a summary that we published along with the IHEC companion papers that were published in November this year in Cell. And this just really highlights four different assays, including chip sequencing, all of which depend on massively parallel sequencing. So those of you who work in the wet lab as well as the dry lab, there's lots of scope for essentially generating little DNA fragments that we can sequence in various different cool ways to measure different epigenetic features. Chip sequencing, just being one of them, we can also measure things like, of course, methylation and three-dimensional structure, as well as open chromatin using assays such as a taxi and many others. And again, I'm happy to discuss the details of that. OK, so getting into chip sequencing, hopefully most or all of you are familiar with this. So this is a slide actually made back in 2007 or 2008, but it's still relevant today. So let's remember that the genome is comprised of these nucleosomes. Of course, they're not in an ordered array like this, although there is some ordering around transcriptional start sites. But we start chip sequencing by shearing the DNA. And this is done in two primary ways. We either cross-link the proteins to the DNA, and then we shear them using sonication. Keeping in mind that sonication is a single-stranded event. So we clip one or the other. And you have to have two of them together in close proximity. And that has implications as you get into the analysis. Or we use something called MNAs, so mononuclease, that can digest in between nucleosome structures. And there's advantages and disadvantages to each of those technologies. Let me just say that for those of you who are interested in studying very small numbers of cells, so in the order of 10,000, 100,000 cells, mononuclease digested chromatin or chip sequencing is the way to go, because then you're not dealing with DNA damage associated with cross-linking. And we and many others have published papers that allow that. And in fact, as I think some of you are aware, even Brad Fertstein's group has gone down to single cells, and I'm happy to discuss that as we go through. But two basic ways, digestion and shearing. And there are combinations, so some people have done cross-linking and shearing. And there's advantages and disadvantages is to each in terms of the downstream applications. Once we have these fragments liberated that are associated to nucleosomes, either in the native state, mononuclease, or in a denatured state, cross-linking, we can then use antibodies to enrich. So then we're essentially just enriching or pulling out these DNA fragments from a population. That's all we're doing. We're taking that population that we've enriched, and then we're shotgun sequencing it, and then aligning it back to the genome as we're going to discuss in detail in module 2. So what are the key considerations for doing chip sequencing? Number one is antibody specificity and sensitivity. There are currently, we spend more time and money validating antibodies than we do using them, than we do buying them, OK? Most of the antibodies that we bring in to the lab fail our validation. And I'm, again, happy to discuss that. I direct something called the CERC, or the Canadian Epigenetics and Environment Health Research Consortium. If you go to the epigenomes.ca website, you can go to our antibody validation page, and you can see the details. And I'm going to go through a little bit of that. So be very wary of the antibody you're using. If you're using the wrong antibody, you've got the wrong specificity, the data you're going to get, obviously, is going to be impacted. The second key consideration, what Mark should I profile, depends on the question, right? So if you're looking at enhancers, maybe you want to use K27 acetylation, K4 monomethalation. You're looking at polycomb, Trithorax, K4 tri, K27 tri. So this is a consideration. What about sequencing depth? This is another question I get asked all the time. These are the current recommendations of the assay standards group within the International Human Epigenome Consortium dealing with human, OK? This is much deeper than it appears one needs to go for mouse. Mouse is about half of this. And there are a number of studies that have there are a number of published attempts to look at essentially saturation of histone marks as we go deeper and deeper. For reasons that we still don't fully understand, the human genome seems to behave slightly differently or has different characteristics than we see with the mouse. And in fact, even when we go as deeply as 100 million repairs or a heterochromatic mark like H3K9 tri methylation, we still seem to be capturing new regions of the genome. And those of you who have worked within transcriptional space, for example, RNA sequencing, you know that's also true. The deeper you sequence, the more complexity you actually emerges. Whether that complexity at the very end of the distribution is relevant biologically, I think is again is an open question, but one to consider when you're planning your experiment. So to summarize, in terms of the antibodies, garbage in, garbage out, whatever you're enriching in your fragments, if you're not using an antibody with specificity, what you get out the back end is going to be uninterpretable. And there are many issues related to antibodies. Not only can you have cross-reactivity, but you can also have antibodies enriching for other modifications. And so again, be very wary of the antibody that you're using. And do perform your own QC analysis of it before you start your experiments. Yes? So you mentioned that sequence again repairs for a period of time, for a period of sequencing. What would be the best sequencing depth for sequence sequencing for a state-of-the-art antibody? OK. So what's the difference between period and? So this is a good point. So repairs and fragments are different things, right? So repairs are the pairs. Fragments, of course, are how many actual pieces of DNA do you sequence? And so it's half. So if you were to do single-end sequencing, it would be 25 million. Because your 50 million paired reads are sampling 25 million fragments. However, I would argue that if you're going to do chip sequencing, you should use paired-end reads. The reason for that is because it allows you to define the fragment size. We initially, when chip sequencing was first developed, we and others developed this computational extension where we would take in from the lab the actual distribution that we loaded onto the sequencer. And we computationally extend the single-end read to what the average fragment length is. That has some issues. And I think Misha might talk about some of that in the alignments. But if you've got paired-end reads, you can define the ends of the fragment. And that actually gives you a lot more resolution. And I would argue even single-end sequencing is probably not going to be something that's going to be around for very much longer. I think most production facilities have probably stopped generating single-end sequencing. So it's a good time to switch to paired-end, right? But yeah, so half as many. Yeah. Another question? Yeah? What should be the read to end for like paired-end and single-end? So again, I think this is mostly. So we know at 75 base pairs, we're mapping most reads uniquely, at least in the human genome. We're going to stick to the human genome. But really, this is now more predicated by the vendor. So this is more of a Lumina issue if we're used. Most of us use a Lumina sequencing, right? There's pretty much the dominant platform. And their reagents are kitted in paired-end 150s or whatever. And the flexibility in terms of read length really isn't there much anymore, at least in my experience. And I think that's going to become even more and more the case. So typically, I think we're running paired-end 125s now, just because that's the way the reagents are kitted. But paired-end 75s would be more than sufficient for most applications. However, if you're interested in such things as transposable elements, you'd want a little bit more, like paired-end 125s will be useful. And again, that's another feature of paired-end reads when you're looking at repetitive sequences. You can anchor one read in the unique part of the genome, and the other read can drop into the transposon. And then you can actually distinguish between different repetitive sequences. So again, I think it's a vendor issue. You had a question? Yeah? So the manipulation pattern can change between cells, even if it's the same type of cell. So if you have a monocyte, some cells can have the information that the stone is something like that. True, yeah. So would you be like advantage to use, for example, single cell sequencing technology so you can differentiate specific cell? So this is OK. So first of all, I think our concepts of what a cell is is evolving, in part because of what we've learned from apogenetics. So this concept of here's a cell type that we can put in this box, and here's a cell type that we can put in this box, I think has largely gone away. We now know from single cell studies as well as functional studies when we actually take those cells and, for example, inject them back into a mouse that there's a functional distribution. So we actually have a distribution of cells that exist in these states. So certainly, I think single cell studies are an important contribution to the field. In our lab, we do single cell DNA methylation. It's very expensive. It's very time consuming at the moment. We're looking to find ways to make it more effective. Single cell RNA sequencing, of course, is another technology that's out there. But keep in mind that you're only sampling a very small fraction of the transcriptome for any single cell. And in fact, if you go down and you look what you're getting, most of the key players aren't there because we're not detecting things like transcription factors that you're expressing at a very low level. So indeed, yes, single cell studies have a place or certainly an emerging area of research one that we've been studying in the context of hemopoietic progenitor populations, which is an area of great interest in terms of understanding heterogeneity within cell populations. Because we know functionally, even though we can sort them to these very high pureties, that indeed, there is still functional differences between them. And when we look epigenetically, they're different. I should say that our work and work of Christoph Bach looking at single cell epigenetics, and this is primarily in DNA methylation, shows us that even if we look at two cells, two cells that have been highly purified, about 12% of CPGs are in the opposite state. So there is this a lot of heterogeneity going on there. And I don't think we fully understand why that is and how that's then being collapsed into the profiles that we see. But that's also an important point to think about when you're doing your research, is that when we generate this data, we're actually doing a consensus. So we're doing a consensus of the population. For example, DNA methylation is binary. So the concept of a 35% methylation at a position is just a reflection of the heterogeneity. And when you talk about, well, I've got a 5% change in, let's say, chip-seq or this histone modification amplitude or in my DNA methylation, what you're saying is that you've got a change in cellular composition, because DNA methylation is binary. So keep that in mind. You're looking at a consensus of a population, not in the individual event. OK, any other questions? If not, I'll move on. So this is the validation that we do. This is published. So we validate both the catalog number and the lot. This is a particular antibody that made it through the pipeline, but many antibodies from Diogenode do not. And again, I'm happy to discuss that. We do both a 384 peptide array. So this is the Agilent peptide array to look at affinity, as well as doing a Western blot. And then for every new catalog that we get, we do chip sequencing, and we then correlate it back to known. So we have a control data set, and there are many control data sets out there. I would highly recommend you do that before you embark in a big study with an antibody. So validate your antibody. Again, this information is publicly available. Encode is published, again, a review of this area. And they're suggested QC measures, which are in align with what we're seeing here. So this is what antibody is this. K4 trimethylation turns out to be probably one of the most well-behaved antibodies. Each one of these is a different peptide, so you can see the highest affinity is with K4. But we also see cross reactivity. This antibody K27 acetylation is one of the worst-behaved antibodies. Very difficult to get a highly specific antibody against H3K27 acetylation. And this is the off-target effects that are shown here. So arguably, if you're working with H3K27 acetylation, please be very careful in the antibody that you choose and ensure that you've validated it appropriately. Garbage in, garbage out. Okay. So sorry, would this pass the validation? This one did, yeah. But this is as good as it... I mean, we've had one where commercial antibody and the first peptide it recognizes is a different acetylation quality. And this is common, unfortunately. Did you see a lot of variation from lot to lot? Yes, yes. Even so, it's the same catalog number, different lot, which you'd think would be the same bleed if it's a polyclonal, but we see different affinities, suggesting that it's actually not the same bleed. And there are these so-called chip-seed gold standard antibodies. My experience has been it's an unsolved problem in the literature. There are monoclonal antibodies that are coming available, so CST has a monoclonal now for H3K4 trimethylation. That I think is a very good quality has been around for many years. Most of the K27 acetylation, for example, is still polyclonal, although there are some monoclonals out there that are starting to emerge. If you can get a monoclonal, obviously that's the way to go because then you have a secure source for replacement. But they have, in some cases, monoclonals just don't behave in chip-seed experiments. Yes. Do you have a point question regarding silver cork? Yeah, so that's a good question. So the cut-off, and this is just a total arbitrary, is 50%. So we want to see that the target band is 50% or greater. Yeah, so, but yeah, I mean, it's interesting when you start to, we actually see differences between cell types too, for reasons that we don't understand, but that's the current criteria. Yeah, not perfect, but it's better than not doing it. So you'll find that some antibodies, you just get a smear, right? So you know that you're pulling down a bunch of stuff, maybe the chromatin's in there too, but you're pulling down a bunch of other stuff too. So yeah, I think it's really important for you to do it before you start your studies. Yes. Yeah, so first of all, you should contact the author and make sure that, because this is now mandated by the journals, within people that work within the IHEC, this is mandated within the metadata. So we, both the catalog and the lot number is provided, as well as how much antibody is used, and of course, that's another important criteria in chip sequencing. I'm not gonna go through all the molecular biology. So can you trust, you're asking me, can you trust a public data set? I mean, it depends, right? I mean, but I would, if that information is not provided, I would be wary, because I would ask, well, why aren't they providing that information? That's like, anyone who works in the field, if you publish a paper, you would wanna include that, right, so that someone could replicate your data, replicate your study, right? So I would then ask the author if it's not provided for it, and I think that's the best you can do. But you can correlate with other data and see that this is representative and those kinds of things can also be helpful, yeah. Okay, all right, so moving on. So this is a very kind of high level overview. This is a chip sequencing processing flow. So of course, we start with sequencing and I'm gonna just talk about the Illumina platform, the predominant platform at the moment, although there are, you know, perhaps other platforms coming online, but it's still the predominant chip seek platform. So in module one, which I'm gonna be talking about, I'm really just gonna take the FASTQs. It sounds like most of you already know what the FASTQ is. So when I ask you questions about it, you're gonna know the answer. But I'm gonna just go through that real quick. And then in module two, we're gonna go from FASTQs to BAMs, which of course are the binary alignment files. So this is the output after the alignment, which then gets further processed into such things as WIG files and other representations of the data. And that's what Misha's gonna talk about. And then David will be talking about integration, I guess in module four later in the course. So then integrating it all together. Okay, so just to remember everybody, chip sequencing, right? Sequencing underpins what we do here. Understanding sequencing is important. And so let's just remember the basic workflow. We can start with any nucleic acid, in this case, DNA. And we, in this case, of course, we've sheared the DNA using nucleosomes or sonication. And then we get a pool of random fragments, which we then end repair and ligate. And if we sonicate, of course, we've got these long ragged ends that need to be filled in or chewed back. If it's mononuclease, it's more of a plant end. We ligate adapters to them. And then typically for chip sequencing, almost exclusively, we then PCR amplify. And at that PCR amplification step, we add our barcode if we're doing, well, it's very typical now to do that so that we can do multiplexing within a flow cell. And everybody's familiar with barcoding, right? Everybody understands that when you're working with a sequencing facility, typically your FASQ file will already be split. So it'll be split on the index. You'll get one FASQ file or two FASQ files, if it's a paired end read, that's already been split by index. So this happens behind the scenes. You don't see it, but keep in mind what's coming off the sequencer is actually three files, right? You read one, read two, which is actually your index file, and read three, which is your read two, right? And I'll talk about how those associations are made just so you can understand. Okay, so of course the Illumina sequencing does clonal amplification on a solid surface, on a flow cell, right? On a solid piece of glass. There are other ways of clonally amplifying DNA molecules. What's the other predominant way in NGS to clonally amplify these fragments? The key characteristic of NGS sequencing. What's the other way that we do this? Anybody? So typically it's done in solution, but you're on the right, you're right that the sequencing itself is done in micro wells, for example, in the 454 or the Ion Torrent system, which was life and now is, I guess, Fisher, right? But how is that, so how is that clonal amplification done? Who knows, not done on a flow cell surface, it's done solid sequencing, and remember that thing? Same idea, right? It's done on a bead, right? It's done on a bead, and so it's done using something called emulsion, where we add a lot of water or we add our PCR brew, our beads, and a lot of oil, and we partition each of the beads and our DNA using an emulsion, right? The problem with that, of course, is that we're using a Poisson distribution to actually get a single bead and a single piece of DNA in a well, and of course we get a distribution, many wells have two, three, four, some, many wells are empty, right? So that's why that technology was really difficult to apply en masse, because we needed a lot more material to actually get it to move forward. So one of the reasons why, arguably, Illumina sequencing has predominated in the market, or at least in the research realm. Okay, so we like it, the adapters, I've already talked about that, and of course as you know, we then make a single stranded, so we denature the material, so it's double stranded, we denature it, we flow it over the flow cell surface, and these little colored things which are a little bit faded here, but essentially you get random hybridization between your ends and the molecules that are grafted on the flow cell surface. They're colored here in light blue and purple, right? So we get this grafting occurring, and then through random diffusion, the thing flops over and finds the partner for the other side. So now we have, we've made a partner, right? So we've got a partner here, this is five prime, that is three prime, that makes it climbing events, and then we can just flow over our polymerase and our nucleotides and we can make an extension. So now we've made a clonal copy of that DNA molecule on a particular location on the flow cell, okay? And we repeat that process 30 times using isothermal PCR, so we essentially denature, do it again, denature, do it again, and we get these clusters, okay? And these clusters exist again in discrete locations on the flow cell surface, okay? And there's about a thousand molecules per cluster, okay? So that's how we generate these clusters. Now they're single-stranded and then we can start the process of sequencing. So again, now zooming in, we've got our three prime end of our fragment, we add our primer, and then we add our nucleotides, right? Of course, well, I'll talk about that in a second. So we add our nucleotides, this should be all of the review for everybody, and of course they're incorporated, or a base is incorporated by sequence by synthesis. And of course, the other technological advance of alumina or selects at the time was the concepts of reversible terminators, right? So these nucleotides are terminated, not like Sanger sequencing where they're diideoxynucleotides and permanently terminated, they're reversibly terminated. So they only incorporate one at a time, whereas other technologies such as 454, such as IonTorrent, are not terminated nucleotides. So if there's a string of a homopolymeric stretch, you're going to get two, three, four, five, six nucleotides all incorporated. And the reason why those other technologies have high indel error rates is because you can't control, or it's very difficult to estimate once you get past three or four nucleotides, how many have been incorporated. So the other major advance of this sequencing technology is reversible terminators. So you incorporate one, you wash away, so you've so-called flush and then scan, we then illuminate with a laser, detect the signal and then start the process again and again. So then we're building up this cluster, one nucleotide at a time. So how do we actually call the sequences? So we start with the first image and we generate what's called the focal map. Okay, the focal map is important and it becomes important later on, which is why I'm telling you this. So the focal map essentially takes what we call tiles. We take a particular image of the flow cell and we divide it into X and Y coordinates, okay? So we have a lane of a flow cell, we have a tile of a flow cell and then we have an X and Y coordinate. And the focal map, which is now generated over the first 25 base pairs, associates unique position on the flow cell with that cluster. And that becomes very important to actually do all the analysis downstream. So once we've generated that focal map, we then essentially, as I think everybody knows, then do one scan at a time and then we build up the sequence from that set of images. In the battle days of selection sequencing, we used to keep the images and actually push them off onto the server. This was back in 2008, 2009, but it was actually a graduate student who came up with a methodology for actually de novo or generating the sequence calls on the fly and then we then stopped storing the images, but I won't bore you with those details. But essentially, we build up the sequence that way. Now there's one consideration if you're working in the wet lab and you're generating a library to keep in mind is that to generate a focal map, the nucleotide representation of your library needs to be balanced, okay? So if you have the first position of your library is all the same throughout your library, it's gonna overwhelm the camera because all it's gonna see, let's first sample say, you added some adapter that you knew what the sequence was, but it's all the same for all the molecules in your lane. The sequence area is then gonna be overwhelmed because it's not gonna be able to tell the discrete positions of each of the individual clusters. So it's important that your first nucleotides are balanced to allow it to detect the individual features. And this becomes important, for example, in whole genome by sulfide data even because that, of course, is a reduced representation. Okay, so of course we know we do paradigm chemistry as well just to keep in mind that paradigm chemistry, so this is the first, I just discussed this, but paradigm chemistry essentially regenerates the cluster on the flow cell surface after we've done read one, right? So we go through the regeneration of the cluster in read two, so read one is here, you make a single strand of the block, we do the sequencing, and then we essentially regenerate the cluster and then we cleave the other end. So those nucleotides that are grafted to the flow cell surface are chemically modified, one of which is degraded, or one of which is cleaved, and the first read, the other is cleaved in the second read. Okay, so very briefly, this is what happens behind the scenes, lots of images are generated, these are generated into intensity files, and up until a few years ago, this was outputted as something called a Q-seq file, but now I think most of us just deal with fast Q-file, so Q-seq file is an intermediate file. There is one metric I think that's worth discussing, which is the so-called Chasti filtering, are you guys familiar with this? This is something that's encoded within the fast Q-file, right, or encoded within a fast Q-file, does anybody know what this is? Anybody heard of this before? Okay, so Chasti is a way of flagging polyclonal clusters, okay, so you can imagine when you do this, because you're randomly drafting these things, or even now when we have order to raise on the alumina, you can get cases where clusters become mixed. In other words, you have more than one fragment associated to a particular position on the flow cell, okay, so you had, instead of having one, you didn't capture one, you captured two strands of DNA, and so you need to identify those, or we want to identify those, to remove those from downstream analysis, and the filtering is, and the flag is called Chasti, this is essentially the algorithm that's used to call it, and essentially it's very simple, it's the brightest intensity, so whatever nucleotide gives you the brightest intensity, over the brightest intensity, plus the second brightest intensity, and this has to be greater than 0.6, fairly arbitrary over the first 25 base pairs, and one is allowed to be a failure, and again, that's just been determined empirically, and this essentially, as I just mentioned, flags polyclonal clusters, so you'll see the Chasti filter is actually encoded within the read name of your FASCII file, right? Is that applicable to the enamelation as well? It's applicable to all sequencing, yeah. And if it's a good sequencing run, you should have very few polyclonal clusters, very few Chasti failed, but it's a good check, right? You can grab through your file pretty quick and see how many Chasti failed reads you have, right? That's always a good thing to do. Okay, so like Sanger sequencing, when Illumina sequencing or NGS sequencing was first developed, we had no measure of quality. I don't know why this thing is getting all, anyway. Phil's picture got a bit big there, but anyway. So initially, when we were doing NGS sequencing, we had no quality measures, and the only way we could measure quality was actually aligning it back to a reference genome. And then from there, we could just say, well, we've got this many mismatches compared to the reference, and so this is such and such quality. Over time, we've adopted the so-called FRED score, which is essentially a log-transform probability, and I've given you the description of how it's called for Sanger just to get you an idea. So things like peak spacing, how much your background signal, et cetera. So massively parallel sequencing or Illumina sequencing uses features such as intensity, separation from the next cluster, and something called phasing, which I'm not gonna talk about, but if you're interested in knowing, I can give you these details on how phasing is determined. And so this is used to actually assign a so-called quality score, a base quality score, to each of the bases in Illumina sequencing. So that's what a base quality, it's a log-transform probability, and the formula is shown there. Yes, yes, where you couldn't, right? So that's the problem, right? Not just a mutation, but a polymorphism. So of course, when we were sequencing human genomes, like doing chip sequencing, and we're aligning it back to the human genome, we would see mismatches, and we would count those as errors, but of course, many of those are polymorphisms. There's about four million positions different in your genome versus the reference, in my genome versus the reference. And so that was a problem with that methodology. So, but it took some time to generate this table that actually is empirically determined, and then it was generated essentially by resequencing the same way it was done in Sanger, resequencing an artificial or a synthetic piece of DNA over and over and over and over and over again, and then recording the characteristics of those bases that were as expected and those that were not to generate this lookup table, which is now used by the algorithm. And I'm not gonna go into details of the lookup table, but just to point out that this is the quality metric. So base qualities, and as you, when Misha discusses the next module, we'll talk about mapping qualities, which are different things, right? Base quality and mapping quality are different. Base quality refers to the quality of the base. Okay, so the output of an Illumina sequencer looks something like this. You'll never see this file, but this is what goes into generating the FASQ file. And I bring this up because the details that are up here, which are actually concatenated into the name of the sequence in the FASQ file is what identifies that sequence read, okay? So what we have is such things as what instrument was run on, what run it is occasionally, what lane, and then importantly, and then the tile, why isn't this not working? Anyway, so then the tile, right? So tile, and then X and Y coordinate, okay? So putting these all things together, concatenating them all together, gives you a unique ID for that read. And of course, when we generate, for example, paired-end reads or even when we associate the index to a read, that is the key that is used to identify that read, right? So read one is generated with one file, read two is generated as another file. And the only way we know read one and read two are associated to each other is by their X and Y coordinates, right? That's the only way we know. So that's how it's encoded. And of course that name should be unique within any given run. Now, there are some cases where that's not the case. For example, if you do multiple runs on a sequencer, and let's say for example, you're not using the instrument name or you're not using the run number, then you can actually get cases where you get a conflict where you've got just by random chance two reads that have the same, two independent reads that have the same X and Y coordinate and tile coordinate. And so the concept of read groups, which I think perhaps Misha will talk about, is developed, so you might have heard of read groups. Read groups are a way of then identifying that chunk of reads is in one group that has this unique identifier versus another group, and then you can merge those into the downstream bound. Okay, so the sequence of course is there and then the quality scores are there, right? So these are base quality scores. So does anyone know what these are? So I told you they're FRED scores, right? So log transform FRED probabilities, right? So why are they letters? Does anybody know? Or how is that encoded? ASI. Right, ASCII encoded. So they're ASCII encoded and why do we encode them with ASCII? So because that takes less, I don't know. Yeah, very simple, right? Instead of using two digits, so FRED scores are like for example, FRED 10, one in 10, right? FRED 21 in 100, FRED 31 in 1,000, FRED 41 in 10,000, right? Instead of encoding 10, 20, 30, 40, we're now encoding one base. And so essentially it's just a way of saving space. And the thing you need to know of course with ASCII is what base is encoded, right? So you need to know that to be able to convert it into a quality score. Now confusingly, up until probably three or four years ago, I don't remember exactly when the flip happened, Illumina used to store their base qualities in base 64. So in some cases, although I think those are probably pretty hard to find these days, it might be in base 64, that's why you need to know what version of the base color was used. And that's one of the parameters in the FASTU that we're gonna talk about. But now everything is encoded in base 33 and base 33 is used because 32 is just the blank space. So that gets us to the FASTU. So FASTU, as everybody knows, is a universal standard for encoding bases and base qualities. It was developed by the Sanger Center. Before FASTUs, we actually had FASTA files, right? And FASTA files just have strings of DNA or nucleic acid strings, but no qualities associated to them. So the FASTU file is a way of storing both the sequence and the quality. And the FASTU file is encodes the quality scores in base 33. So keep that in mind. And this is the format as everybody knows. So this at in my first sequence would be that concatenated string and we'll look at that in a second. Your sequence string, there's an optional additional flag and then your quality and then you next and you next and you next. Okay? Okay, I'm almost done. I'm supposed to end at 10, is that right? Okay, but I'll be real quick here. Okay, so now you've completed your chip sequencing experiment. You've gone through the molecular biology, you've gone through the sequencing, you've got your FASTU file. How do you assess the quality, right? This is a fundamental question. You always wanna make sure that your data is good before you start doing stuff with it. So how do we assess this, right? Well, three main ways. Sequencing quality, right? Does the sequencing look good? How do we measure that? You can look at the base qualities, for example, and we can look at other characteristics as we'll talk about. Library quality, what's the quality of your library, right? Are we seeing an equal representation of the genome? Are we seeing lots of duplicates, lots of PCR duplicates and IP quality? Did our amino precipitation work? Are we actually capturing regions of the genome that we expect, et cetera? So first, for sequencing quality, there are a number of tools out there that do this. FASTQ is one that's been used by the community for almost a decade now. That's fairly easy to run. It can be run both in command lines, so you can do many at the same time, or you can do it as a user interface. And this essentially provides a set of analysis tools that you can use to examine the quality of your data set. So for those of you that are starting out, this would be a very easy tool for you to begin to get familiar with your data sets. And that's the path to it, and it's been installed on the server, and we're gonna go through it. Okay, so that's how we're gonna look at the sequencing quality, but what about the library quality? How do we look at the library quality? So some of the characteristics that we look at, and this is that you can look at are for example, insert size distribution, okay? So I bring this up because you will see a lot of variation in insert sizes, and insert size distribution actually is also dependent upon your immunoprecipitation conditions. Certainly that's true in native chip sequencing, perhaps not so much true in cross-linking, but certainly true in native chip sequencing. And so what I'm showing you here is the DNA fragment length, and this is just a distribution that was generated using the read pairs. So we've aligned them to the genome, and then we've just plotted what's the average distribution. So this is the number of fragments on this axis, and this is the DNA fragment length, right? So we're just counting how many times we see, for example, 100 base pairs, 200 base pairs, 200 base pairs, and we're plotting it. The black line indicates the input. So this was the chromatin that we immunoprecipitated or that we digested, but we didn't immunoprecipitate, we just shotgun sequenced it. And as Misha will discuss, having the input is critical for determining your background. But that's our input, right? So these black lines are our input, and then these red lines, and there's actually two different shades which are difficult to see here, are our IPs. The point I wanted to make here is that just by doing your IPs, you're actually shifting your insert size distribution. Anybody think why that might be? Why should we see this in the data? So this is our input, let's remember, this is the distribution of fragments that we put into the IP. This is the distribution of fragments that come out of the IP. Black is what we put in, red is what we got out. Why do we see this shift in size? Say that again? You're an enrichment. Right, right, so that's interesting, right? So for example, although you can't see the difference in shades, you can see that actually A27 trimethylation, right? Polycomb actually is enriching for bigger fragments, larger fragments, even though this is what it was presented with, right? But this is what came out, right? And remember that we're giving it a distribution of fragments, right? And so of course these fragments will present in this, in the input library, but at a very low frequency. And in our input, we've actually enriched for those, highly enriched for those, right? In what we ended up sequencing on the sequencer, right? This is native chip. In crosslink chip, you don't see this shift to the same degree, but it's still present, right? So we can look at these distributions as a way of saying, well, you know, does this make sense? Am I seeing what do I expect? Trichorax, which is a four trimethylation you see enrichment of lower fragments or lower smaller fragments, which might be indicative of, for example, more open chromatin where these are being enriched, okay? So that's one way to do it. We can also look at the library quality by looking at the aligned fraction. So how much of our reads do we align? And this is actually a screenshot from our production facility. So each one of these bars here represents one chip sequence, random set of chip sequences. And so what we do is we compare the aligned fraction of this experiment against all other experiments that we've run. And we say, well, does this fit within the distribution? You know, this is around 90%. So using current state-of-the-art sequencing technologies and a good IP, you should expect to see greater than 90% alignment in a chip seek experiment, at least when you're dealing with histone modifications. The very lowest I would say you should move forward with is 50% mapability, but there you know you probably have some molecular biology concerns to think about. But so that's one characteristic. So we wanna see a majority of aligned reads. And I think this is just looking at it in a different context. So now we've used a mapping quality and I'm not gonna go into that because I think Misha's gonna talk about it, but this is essentially looking at aligned reads after we strip out replicate reads. Of course, we can also look at PCR duplicates, right? So how many PCR duplicates do we have in our library, right? And of course, PCR duplicates are also dependent upon how deeply you sequence the library. If you completely sample the diversity of your library, right, you're gonna expect to see PCR duplicates. So the more you sequence, the more duplicates you're gonna see. But at the levels that we're sequencing here, which as I said, 50 million read pairs for K4, 100 million read pairs for or for punctate marks like K27 acetylation, 100 million reads for K36, these are the duplicate reads. So you can see that for good libraries, you should be in the regime of about 1%, right? One or 2%, it should be much more than that. If you're up in this regime here, where you're at five, six, seven, eight, nine, 10%, likely either you oversample the diversity or you have oversampled the diversity, and you might wanna ask yourself, well, am I at a sequencing depth where I expect to be oversampling the diversity of my library? If I am, maybe there's a problem, maybe I over PCR, right? Maybe I've done too many PCR cycles. Okay, finally, another very quick way to do it is to so-called look at your domain reads. So you can just look at your IP, and if you're using marks that you have expectations about their locations in the genome, you can simply take those and map out or the fraction of reads that fall within those domains. So for example, with H3K27 acetylation, the top panel, we can take enhancer states. So all we do is we take all the enhancer states that have been published in the NIH roadmap. We have a set of coordinates, and these are easily obtainable. And then you can just ask, well, how many reads are falling in these enhancer states? And this is the kind of thing you see. So if you read this, this is 10, 10, 15% of my reads are falling within enhancer states, right? And you can see that it's fairly stable, although there are outliers. Similarly with H3K36, so this is a mark within gene bodies. If we just take protein coding genes and ask how many reads fall within protein coding genes, you can see that we're in the 60, 70% machine. Very simple way of looking at your data convincing yourself that this IP is of good quality and worth pursuing, right? Don't just jump in, generate WIG files and start looking at the data, right? You must look at the quality of the data before you proceed. Okay, so with that I will stop. We will talk about additional metrics in module two with Misha.