 All right, so I guess thanks, everyone, for coming out this morning. So I'm going to be giving an introduction, essentially, a very brief introduction to epigenomics. Obviously, there's a lot to unpack there. Then I'm going to spend time talking about chip sequencing. And I'll spend a little bit of time talking about the underlying technology, which is obviously massively parallel sequencing, which has really driven the field of epigenomics, as I think everyone in this room probably knows. In fact, that's how I got into this field back in 2007 was some of the first experiments that we ever ran on the Selexa platform at the time were chip seek experiments. And so I'll be talking about how the FASQ files are generated, which is, of course, the commodity of chip seek sequencing, as well as whole genome bisulfite sequencing. You've seen this already. This is my introductory slide. As I mentioned, I mean, I've really been in this field for about a decade now. And the view I'm showing you here is from the Roadmap Epigenomics Consortium. And I'll be talking about some of the data that we've generated through that. And in fact, the lab practical is based on data generated as part of the roadmap. So the learning objectives for the module this morning are really seem a bit redundant given everyone's experience here. But we're going to be talking a little bit about some epigenomic principles. And then I'm going to go into the basic molecular biology, driving Illumina next generation sequencing. So we'll talk about how we generate the clusters and how we generate the paired end reads, which will then flow into how these are then converted to FASQ files. And I'm a big proponent for understanding the data that you're dealing with when moving into the bioinformatics. Then I'll talk about some of the underlying principles and challenges of chip sequencing, and then do a very high level overview of the chip seek data analysis workflow. And then in our practicum, after the coffee break, we'll then work on running a script called FASQC to look at the quality of the FASQ outputs. So given the background of this audience, I like to use this analogy, which is, as I'm sure some of you have heard, if we think about the genome as a hard drive, really the epigenome is the software that runs on top of it. And the way we like to describe this as part of the roadmap project was this idea of a roadmap, so it's a way of the cell interpreting the genome through the layout of essentially roads and networks that that particular cell can interpret the genome. So the genome of a liver cell and the genome of a pancreatic cell is of course the same, but the roadmap in which that's interpreted is different and that roadmap is encoded within the epigenome. I always like to start my epigenome talks with going back to the history and as of course I'm sure everyone is aware, the term epigenome or epigenotype was actually coined by Conrad Waddington back in 1942 before we even knew that DNA was the hereditary material. And he was using this as a way of describing how genotype became phenotype in the context of mouse, of fly wing development. And these are, I actually found this journal in the archives at UBC and that's the actual figure there. So if we fast forward and think about what actually we were talking about here, really talking about genotype, giving rise to phenotype and of course there are many components of that, transcription factors, external cellular signaling, map kinase signaling, et cetera, et cetera. So there are many components to it and of course the epigenome is just one component of it. A few years after Conrad coined the term of epigenome, he came up with this diagram that I think probably everybody in this room is familiar with, this concept of starting from a sort of open plastic state of a totipotent zygote and kind of running down these valleys. And as we differentiate down into terminally differentiated cell types such as blood, T cells, breasts, myopathyletal cells or brain cells, neurons, or glia, that the epigenome becomes more and more restricted. But of course, as all things in biology, nothing is in black and white and we now know that in fact we can actually get over these little humps. We can reprogram cells and we can reprogram the epigenome although that reprogramming at least the best that we understand is not complete. So I heard amongst the folks that were here today that some people are studying the environmental interactions. Of course, that's another very interesting aspect of the epigenome, the fact that it's potentially reversible and the fact that it interacts with the environment. So in this concept, the epigenome is providing a way of the cell interacting in a much shorter timeframe than perhaps genetic changes would occur as a consequence of evolution. And if we think about it, if we have an environmental impact that infects the epigenome, that might affect the epigenome of for example, a breast cell. And that may have negative or in some cases perhaps positive consequences. If you look epidemiologically, there's some curious data out there and I heard somebody studying hereditary. So how epigenetic marks are passed on. And of course, this is a famous study looking at data from a northern Swedish town called Ovalax where they looked at dietary intake or the harvest and then look at the mortality rates of the parents, of the grandparents, the parents and the grandchildren. And they came up with this curious relationship between nutrient uptake and an increase in mortality of the grandchildren. And so there's these observations that suggest that these epigenic marks are in fact heritable, okay? And so if we think about it in that context, an environmental impact that might impact, say, a parent could potentially be passed on to their offspring. And in fact, that could then be passed on to further generations. But I would just like to point out that as of today, there actually isn't, there is no concrete evidence of hereditary. When we talk about hereditary in the context of epigenetics, what we're talking about is mitotic inheritance. We're not talking about transgenerational inheritance. I think that's something important as you begin your careers or those of you who are already well in your careers to understand is that our concept of this inheritance is really at the mitotic level, not transgenerational, okay? So the classic description of what epigenetics is, it's the study of heritable. And again, when I talk about heritable, I'm talking about mitotic inheritance. So changes in gene expression that occur without a change in DNA sequence. And there have been a number of technologies developed, in fact, over the last 10 years to study the epigenome. And initially this was done using arrays, but now as I think everyone in this room knows, this has been largely transferred to massively parallel sequencing technologies. And we'll be talking about two of those technologies in this workshop. We'll be talking about chip seek, which everyone I think is familiar with, which is the study of histone modifications in the context of nucleosomes. And we study those again using chip sequencing. There are technologies such as DNAs one sequencing, really pioneered by John Stam at Washington University for looking at open regions of chromatin. So this concept of open and closed regions of chromatin. And these DNAs ones, hypersensitive sites have been leveraged by John and by many others in the research community as regions of open chromatin that are engaged by transcription factors. RNA of course has been shown in some cases to play a role in mediating epigenetic regulation. And we study this using sequencing, using RNA seek. But of course RNA seek isn't one thing, it's many things. So it depends on the molecular biology that is driving the generation of the libraries that go onto the sequencer, depending on if you're using polydentalated or ribodepleted, et cetera, et cetera. And then of course the DNA itself is modified. So in mammalian genomes you have about 58 million CPGs and these are methylated. And I'm not gonna go into that today. Guillaume and David will be talking about that tomorrow. And of course these things don't, epigenetic modifications don't operate individually. They operate as David Alice has coined as a histone code as far as we know. So for example, we can have open regions of chromatin that are DNA methylated and these would be presumably or these have been shown to be resistant to certain transcription factor recruitment. And so DNA is one hypersensitive sites that are hypo-methylated are those sites that are thought to be actively engaged in or actively bound by transcription factors. Okay, so in today's talk I'm gonna be focused on chip sequencing and here's a diagram of a nucleosome. And as I'm sure everyone in this room is aware, the nucleosome contains core histone molecule and then an unstructured and terminal tail and these unstructured and terminal tails are decorated with a variety of post-transcriptional modifications, including some of them that are shown here. So there's acetylation, methylation, phosphorylation, ubiquidation, assimilation, et cetera, et cetera. And there's a whole series of enzymes, the so-called readers, writers, and erasers that many of us study and these enzymes work to add, remove, and read the histone code and of course the structural genomics consortium as well as many others are focusing on understanding these enzymes or generating structures for the enzymes as well as generating inhibitors to them. Okay, so there are hundreds of epigenetic modifications which is very daunting. And so how do we begin to study these? Well, back in 2008, the Common Fund, the NIH Common Fund funded the so-called Roadmap Epigenomics Consortium and its task was to generate so-called reference epigenomes. And as the first meeting in fact of the roadmap that we had in Bethesda in Washington, we came up with a description of what a reference epigenome was. And how we came upon that description was essentially pioneering work from the encode consortium, which some of you might be aware, where we tried to select a set of histone modifications that describe the variability of the epigenome as broadly as possible. So many epigenetic modifications are actually appear to be redundant. When you have one, you can call the other by the presence of one. So for example, promoter and active promoters, we may have H3K4 tri-methylation, and that is co-occurs, for example, with H3K27 acetylation. So perhaps both of those would not need to be done to mark promoters. So I do just wanna briefly mention, so these are the six marks that were selected by the consortium to generate the so-called core reference set. And these are the modifications that have been adopted by the International Human Epigenome Consortium, of which the CERC network, a co-founder of this, or at least a founder of some of the travel awards, is part of. Now, how are histone modifications, what is the nomenclature? So this is the accepted nomenclature for how we describe histone modifications. So they start with the histone itself, so in this case H3. The next two characters indicate what amino acid is modified, so this is in this case, histone H3, lysine 4, and then trimethylation. Methylation can of course occur as monomethylated, dimethylated, or trimethylated, and so I'm showing you in this particular case the trimethylated form. So H3K4 trimethylation, H3K9 trimethylation, should be pretty straightforward. So in terms of the reference core marks, there are two sets, those that act to reinforce open chromatin or active chromatin, and that's this set here. So H3K4 tri is active promoters, enhancers, H3 mono, K27 settle, and then K36 trimethylation associated with elongating transcription. Of course, we could spend a lot of time talking about each one of these, but I'm just giving you a very high level overview. There's also two marks that are associated with oppressive chromatin, H3K9 trimethylation, and H3K27 trimethylation. We'll be talking more about the characteristics of these marks and why they've been labeled in these ways as we go through the lecture today. Okay, so I talked about what epigenetics is. We talked about some of the epidemiological, of course, there are other epigenetic effects that have been annotated, which has made this field, I think, very interesting as a field to study. Some of you might be aware of the Dutch hunger winter or in China, the Great Leap Forward, where a population was put under severe caloric restriction, and the babies that were born of the women that were pregnant during that period of time had a higher incidence or have shown a higher incidence of cardiovascular disease, psychiatric disorders, et cetera, and what is potentially even more interesting in some ways is that the grandchildren, so that the children of those children are also showing these characteristics. And so, of course, epigenetics has been evoked as a possible mechanism to describe how this is being inherited through the generations, although as I mentioned, as of today, we don't have any direct evidence for this, and I'd be happy to talk about the reasons for that in the break. We also know for common traits and diseases, the link between genotype is not as strong, so for some characteristics, such as height, if we look at the difference between monozygotic and dizygotic twins, you can evoke genetic mechanisms, can describe a significant portion of the variation in the population, but for many common diseases, it isn't. So in many cases, one twin will develop, for example, breast cancer, but the other will not. So this suggests that genetic mechanisms are not the sole mechanisms for driving these, or for the incidence of these diseases. And more recently, memory, it seems as though many, or at least some functions of memory are being patterned through epigenetic mechanisms. In this particular case, I'm showing you a memory model in fly using a knockout of a particular epigenetic modifier, G9A. So in the field that I work in, which is cancer, as I think some of you are probably aware, again, thanks in large part to the technological advance of sequencing. We've spent the last almost decade deeply sequencing a wide variety of cancer genomes. And one of the surprising findings that emerged from that work is that genetic lesions to epigenetic modifiers appear to be very common events in many types of cancers, and in some cancers more than others, one of the type of cancer I study is myeloid malignancies. And in myeloid malignancies, there seems to be a high prevalence of mutations to epigenetic modifiers, for example, IDH1, IDH2, which generates an uncle metabolite, and other enzymes, DMT3A, which is involved in laying down DNA methylation, TET2, which is involved in hydroxymethylation. So both in epidemiological studies, as well as through cancer genome sequencing, I think the epigenome is being, I think the link between the disease and the relationship of the epigenome is being established. Okay, so that was sort of my 10 minute, very high level overview of the epigenome, just as a way of introduction. This is the summary. Are there any questions, or does anyone have any comments about the biology of the epigenome you'd like to raise now? If not, I'll move on to the next section. Not hearing anything? Okay, so that's the epigenome, but what we're really here to learn about is the bioinformatics behind how we actually study it. And so for the lecture this morning, I'll be talking about chromatin immunoprecipitation sequencing. So what is chip seek? So chip seek is essentially a way of profiling histone modifications genome-wide. And this is a diagram that I came up with a number of years ago, which essentially shows you the flow of how this works. So essentially we start with a cell or a genome, and then we digest the genome where we shear the genome. So we either use sonication waves to shear it, or we can digest it with M&As, so called native chip sequencing, which is something that's becoming more prevalent in the literature. Or we can use cross-linking, so we use formaldehyde cross-linking, and then we shear the DNA using sonication sound waves. Once we have those shears, and we typically shear into the range of 100 to 300 base pairs, which coincidentally, of course, is within the range of a single nucleosome wrap. So a single wrap of DNA around a nucleosome is about 150 base pairs. But the actual range of 100 to 300 came out of initial experiments on the SELEXA platform, which turned out to be about the appropriate length for generating clusters, which we'll get into in a moment. And so that's the typical range that the shearing is done. So once we have this sheared set of fragments, either by native digestion or by sonication, we then immunoprecipitate with an antibody that's specific for that histone modification, which is shown there. We then, which essentially enriches for those DNA fragments that are associated to the nucleosomes that are modified in the way that, or modified with the modification that we're interested in. So then we have an enriched set of fragments that we can then study. The original experiments done by Bingran and others use chip chip data, so they basically used a chip array. So you strip off the proteins, you're left with the DNA, you can label it and probe an array, either a promoter array or a tiling array. Of course that technology's been largely replaced now with our ability to actually sequence those fragments. And so we take those fragments, strip off the proteins, and we generate a library, a so-called library, and a library is essentially a collection of fragments that we've added DNA sequences to the end that allow us to generate clusters and then sequence them. And once we sequence them, we then align them to a reference genome, and then we build essentially profiles of what those modifications are. Okay, so what are some key considerations in performing chip sequencing? So first of all, antibody specificity and sensitivity, I can't stress this enough. If you're working with data that others have generated, I would take the time to figure out what QC has been done on the antibody, okay? The other consideration is what marks should I profile? Again, it would depend on what you're trying to do. If you're interested in polycomb group, you probably want to study H3K27 trimethylation. If you're interested in enhancer states, then you'll probably want to study H3K4 monomethylation, H3K27 acetylation. If you're interested in, for example, endogenous retroviral silencing, then perhaps H3K9 trimethylation is the appropriate mark. So it really depends on the question that you're trying to ask, and you need to design your experiment and pick the marks that will best answer that question. Sequencing depth, this is a question I get asked all the time. So how deeply should I sequence the marks? And what I've given you here are actually the current recommendations from the International Human Epigenome Consortium. When we first started the roadmap project, sequencing capacities were much reduced compared to what they are today, and back then our target was 12 million reads per IP, but we've subsequently learned that that's insufficient to sample the diversity of the libraries that are being pulled down, okay? And so the current depth recommendations are essentially 50 million read pairs, which is equivalent to 25 million clusters, right? So 25 million observations of the fragments, so to read pairs, of course, we're reading both ends of the fragment, I'll go into that in a little bit of detail, for punctate marks. And so that would be an example of a mark that has a punctate observation in the genome, as for example, H3K4 trimethylation or H3K27 acetylation, right? These marks tend to have very punctate occupancy of the genome. And for broader marks, such as H3K9 trimethylation, in many cases, H3K4 monomethylation is also considered to be a broad mark. It's found very widely dispersed on the genome. A higher read depth is recommended, and there we're getting to 100 million reads or 50 million fragments, okay? So those are the current recommendations. And again, be happy to discuss the details of that as we, during the break or later in the lecture, okay? I guess the other point to keep in mind is if you are working with data sets that are generated from limiting numbers of cells. So I talked about native chip sequencing. So the original formaldehyde crosslink sequencing, we would go in with about a million cells, which obviously is not, which is a lot of cells. Cell line's not a big deal. When we're talking about primary human tissue, that becomes very limiting. Native chip sequencing, where we're not using formaldehyde crosslinking is a way of actually reducing those numbers significantly. So you're able to get down to 100,000 or even 10,000 cells. When we're down to limiting numbers of cells, you can actually sample the diversity of that library with less reads than is required to sample the diversity of, say, a crosslink library that comes from a million cells. And so as you design your experiment, you should be thinking about, okay, really what I'm trying to do here is sequence the diversity of that immunoprecipitated library to sufficient depth to be able to represent it in the subsequent analysis. Okay, so I'm just gonna spend a little time talking about chip-seq antibody validation. I'm sure everyone in this room has heard this before, but garbage in, garbage out, right? So if we don't know what the specificity of the antibody is, the data that you're gonna get out of it and you're gonna spend a lot of time analyzing it is not gonna be of high quality. So I really recommend you spend the time to understand the biology. If you're doing the immunoprecipitation yourself, spend the time to find the right antibody and QC that antibody. As it is today, there are no, as far as I'm aware, there are no high quality chip-seq antibodies that you can rely on time and time out. With the exception of some monoclonal antibodies, so there is one, for example, from cell signaling technologies, which is a monoclonal, but most antibodies that are commercially available are polyclonal, which means there's only a limited amount and you may order that lot and in six months you may order it again and it'll be a different lot. So within our own center, as part of the, initially the roadmap and now through the CERC initiative, we do a series of QC on all antibodies that we bring into the group or into production. And this includes a antibody peptide array or a peptide array. This is essentially an array of 384 modified peptides and we look for specificity and so that's what's showing here. I can't actually read this because it's so small but I'm hoping that's the, so this would be the targeted mark here, H3K4 trimethylation. You can see there's some off-target effects but the majority of signal comes from the antibody. We also do a Western blot to look for specificity in the IP. So here we're essentially blotting a whole cell lysate and then looking for the specific band to show specificity. And if we're ordering a new catalog number from a vendor, we will also take this all the way through the chip sequencing and compare it against bioinformatic QC metrics that I'll talk about a little bit later in the lecture to ensure that its performance is what we expect. And we spent a lot of time hunting down antibodies that pass these metrics and I can tell you even today you could go out and order a gold standard chip seek antibody as annotated by the vendor and it will fail these QC analysis, okay? So do not, my recommendation, if you're working in a chip seek, spend the time to QC your antibody. And once you've found an antibody that passes QC, get as much of it as you can within the limits of how much you're gonna run. Here's another example, acetylated antibodies. This is H3K27 acetylation. You can see, I just wanted to show you this, just to show you that the off-target effects of K27 acetylation antibodies are higher than we see typically with methylated antibodies and this is something that is just a fact of this particular epitope and there are off-target effects to keep in mind. Okay, so are there any questions about that? Yeah. This data, if you're interested, is available here. If you're interested in looking at the QC of these antibodies and I'd be happy to discuss it later. Okay, so talking about chip seek processing. So this is one way that I like to think about it and I think in one of the handouts or the recommended reading, this, the manuscript that describes this process flow was included. And essentially what this shows is this so-called concept of levels of analysis, so level zero, level one, level two, level three. And I'm gonna spend today, or I'm gonna spend my lecture primarily focused on level zero, so basically going from the sequence library to the FASTQ file, what that is. And then Misha will be talking about level one, which is essentially taking that FASTQ file and aligning it and then level two doing the segmentation. And I guess in tomorrow's lecture, we'll be talking a little bit about level three, which is integration, visualization and analysis. So to start with, I'm gonna be talking about chip seek processing. I'll be talking about the generation of the FASTQ files. Okay, so of course, chip seek stands for chip sequencing. And so understanding the basic principles about how the sequencing works is an important first step to doing any bioinformatic analysis that's dependent upon the resulting files. So of course, Sanger sequencing is the first high throughput sequencing technology that was ever developed. This of course, Dr. Sanger here, who actually won two Nobel Prizes, which is always pretty amazing. One for sequencing protein and one for sequencing DNA. These are the gels that not so long ago represented really the state of the art in DNA sequencing. And of course, this involves the incorporation of terminally or di-deoxy nucleotides, which are either radioactively labeled or fluorescently labeled and then visualized. This was industrialized as part of the Human Genome Project where we went from these slab gels to capillary gels, first single capillaries in the instrument at the time, which is the ABI 310, and then to the 3730, and then the farms of these, literally hundreds of these instruments were then arrayed out at some of the big genome centers, for example, Washu, the Broad, Sanger Center, and these were the instruments that generated the Human Genome Project. So why do I bring these up? I bring these up because when sequencing was first developed, we had no way of assessing the quality of a sequence. We got an A or a T or a C or a G off the back end of these platforms, or even if you look at this gel, well, I can say that in this lane here, it looks like a C, but how do I encode quality of these bases? And this, of course, becomes important when we start talking about tens of thousands or millions or hundreds of millions of sequencing reads. And so Phil, Phil Green here from Washington University, developed the so-called FRED quality standard, which is shown here, which is essentially a log-transformed probability. So it's a probability that that base is called correctly or correctly and the formula is here. How are these probabilities determined? Well, these probabilities are determined essentially empirically, okay? So they're determined by looking at the performance of, in this case, Trace, which I'll talk about in the next slide, and then developing a series of essentially, in this case, four characteristics that would then go into calculating this probability. And so in FRED, this is what it looks like in an analog sequencing run. So this is essentially an example of a trace that I hope everyone in this room is very familiar with and I can't actually read this. But you can see that, for example, this base here, this T is very well separated from its adjacent T's and you get a nice separation. You don't have a high background coming up. And so based on these four metrics, this base is given a very high FRED score, a FRED of 50, which is 99.99% accurate. A base, for example, here, where you're getting a much less separation between the bases and you have a higher background is given a lower FRED score and I can't actually read what it is. I think it's FRED 10, which is essentially a 90% probability that it's correct. And then there's bases here where you can't even tell what it is. In this case, it's given an N and it's given a very low FRED score. So in the initial generation of Selexa data, which then became Illumina, and this was true also of the solid system and true of the Ion Torrent system, there was no way of calculating qualities because we had no empirical standards to work against. And so the initial quality scores that were generated on these platforms were essentially based on taking out those reads and realigning them to the reference genome and seeing what the mismatch rate was, which of course is fraught with difficulties because we have SNPs and we have other variations in our reference genome, which would skew our quality scores. But over time, a similar set of characteristics, and I'm not gonna go through them, were developed for how separated the clusters are, how the intensity of the bases, the intensity of the fluorescence for the clusters we'll talk about, which allowed us to move into essentially a log-transformed quality scores. So now for Illumina sequencing, for every base we have a log-transformed quality which is based on the FRED system but using a different set of empirical measures. Okay, the other thing I want you to keep in mind, especially if you're looking at historical data, is that there's been a rapid development of the next generation sequencing platforms. And so I think this is showing from, this was the initial instrument that I was actually introduced to in the fall of 2006, which is the Selexa instrument. And in only six years, we went from this instrument to this instrument. And that involved a lot of different changes, including the incorporation of these quality standards, incorporation of paired-end reads, incorporating of the index read. So there are many changes that went on. And so as you go back historically into the data, you need to be aware of what relative time period we're talking about, what Illumina processing pipeline was used or solid processing pipeline was used to generate that data because it has consequences. I'll talk about one as we get into it, but for example, ASCII quality scores. There was a time when the ASCII base was 64 for Illumina and it's 33 for FASTQ. So if you go back for it far enough and you find some of that original data, you got to be aware of what ASCII system you're working in. So it's just something to keep in mind. Okay, so how does library generation work? This is how it works for those of you who don't work in the lab. So we've talked about the IP or you can start from DNA or CDNA or whatever you like. You shear into a set of random fragments and then we ligate adapters to it or there are other technologies for introducing the adapters, but I'll just give this basic overview. So we ligate adapters and then we usually do a PCR step to add oligos on the end of these fragments that are complementary to the oligos that are grafted onto the flow cell surface, okay? And these are what enable us to do sequencing on the Illumina system. And more recently, within the last five years, we've also include index sequences. So we include a hexamer index, which is incorporated into the adapter that allows us then to pool multiple libraries together on a single lane of Illumina. And then they are subsequently divided by index and when you see your analysis, it's all from the same lane, but they're labeled by their index, which is incorporated again into the barcode. More recent technologies now incorporate two barcodes. So you can have scenarios where you have barcodes on both ends of the fragment, which allows for even higher orders of multiplexing or error correction. The current generation of high-seq sequencer is here. So this is the high-seq 2,500 or 4,000. That's the output of it. There's also, of course, some of you may have heard of the high-seq X platform. There's a few of these, there's three such platforms here in Canada and of course in the U.S. and many other places. But just let's keep in mind when you're doing your experimental design that you can't use the X platform for chip sequencing because Illumina doesn't allow it. So you can't use it, so you'll have to base it on the high-seq sequencer. This is what a flow cell looks like and now I'm gonna go through how we actually do the sequencing itself. Okay, so we start with the DNA fragments. We ligate the adapters and in here these are the purple and blue things. So we end up with this collection of fragments that have these purple and blue ends on them. And then we flow them over a flow cell surface onto which oligos that are complementary to either the blue or the purple have been grafted to the glass of the flow cell. The original Illumina sequencers had, were able to image one side of this piece of glass. More recent generation like 2500s and 4000s actually image both sides. So you have surface one and surface two. But essentially the principle is the same. We generate these or we capture these clusters on the flow cell surface. Now what about other sequencing technologies such as solid sequencing or Illumina or Ion Torrent sequencing? It uses a very similar process but instead of grafting onto a solid surface we're grafting onto beads. So we're grafting in what's so-called an emulsion. So we're generating an emulsion where we're adding our fragments. We have a bead that's been decorated on the outside with the oligos that are complementary to the oligos that you've added on your adapter. And then we go through our amplification. Now what's the advantages or different disadvantages of the two systems? Well, one thing to keep in mind is that for emulsion we need to be thinking about Poisson distribution. So when we're talking about sampling the library we're really looking at because we have to generate an emulsion and we wanna drive that emulsion so that on average that we have one bead and one fragment of DNA. We have many, many of our so-called micro reactors are my cells that have only one bead, no piece of DNA and many that have DNA and many that are empty. And so the volumes that are required to actually do emulsion PCR are significantly higher and the amount of material you need is significantly higher than it is on, say, a solid surface. So, and for today's lecture I'm gonna primarily talk about or I'm gonna exclusively talk about aluminum sequencing but the principles are essentially the same. Okay, so we flow this over, we engraft and from there we then through a process of random diffusion the piece of DNA flops over, hybridizes to its parents that's in the vicinity and of course the original luminous sequencers this is random distribution of these fragments were randomly grafted onto the flow cell surface so they're all over the place. In the high-seq 4000 and the X platform these are now printed in an ordered array and that has some implications which I'll very briefly touch on but for the vast majority of data that you'll see that's been generated and probably the data that you're gonna be generating would be done on the 2500 which is an unordered array so it's a random array, okay? So, essentially you get this where you get a diffusion then you get a priming event that occurs here because you've got your purple here your five prime to three prime and we do a single extension so isothermal extension and this generates essentially a copy of that fragment so then we repeat that process 30 times or 28 times or whatever however many times you're generating your clusters we generate a group of clusters that are a group of fragments that are clonal copies of the original captured piece, okay? We then go through, we essentially use either chemistry or enzymatic process to cleave one end of that fragment and so we're left with a single-stranded molecules that are oriented in this way that allows us to prime, to do a sequencing reaction and again to make the point that in the 2500 and earlier series of the lumina these fragments were in fact randomly distributed on the flow cell surface and this will become important when we talk about sequence naming. Okay, how does sequencing work itself? So now we're just blowing into one end of it we essentially add our fragment or our sequencing primer and our sequencing reagents and then we incorporate one base per cycle and of course the technology that founded Illumina is this idea of reversible terminators which separates it from Sanger sequencing where once that base is incorporated it's permanently terminated. Illumina technology allows for reversible termination so in other words we can incorporate one base stop the extension and then remove that termination and allow the sequencing to continue. This is in contrast to for example Ion-Torrent sequencing or solid sequencing which doesn't use reversible terminators rather it flows each of the nucleotides over one at a time which can be problematic for homopolymeric repeats, et cetera. But essentially we get one incorporation event we remove the unincorporated bases we shine a laser and detect the signal and then we essentially repeat that process over and over again to build that sequence base by base. So very conceptually this is how the sequencing itself actually works so we start with our first essentially our first incorporation event, cycle one and then cycle one we generate what's known as the focal map and the focal map gives us positions of where each of these clusters are on the flow cell surface. So if we think about a flow cell so we can think of a flow cell as a lane we divide the flow cell into tiles and each tile there has an X and a Y coordinate to it. And when we have a series of clusters generated each of these clusters gets an X and a Y coordinate so perhaps this is X5, Y1, et cetera, et cetera, et cetera. This becomes important because these X Y coordinates are essentially how the sequences are named and how the sequences have unique identifiers amongst the millions or in some case billions of reads that you might generate in the course of an experiment. So the first step is generating this focal map and for those of you who have done some Illumina sequencing you'll know that if you use a library that has reduced diversity in the first few bases you may have issues with generating your focal map and the reason for that is because of all of my bases in the first incorporation are for example what's green here, T then I'm gonna get a very strong T signal and I'm not gonna be able to see the difference or I'm not gonna be able to separate the individual clusters and so this becomes a problem. There are solutions around that for example spiking in a diverse library such as FIACS for example. Newer generation sequencing platforms like the HighSeq 4000 use an ordered array and so this issue is no longer relevant. Okay so we start with the focal map and then we essentially we build the sequence one base at a time from there and at every incorporation we scan or we take an image and that image stack builds up. Okay so then we to convert the images to sequences we essentially you can think of those stacks as being stacked up. We basically flip through the cards looking at that same focal position so let's say this is tile one so looking at tile one position y, y1, x5 let's say a T and then becomes a G, a C, etc etc etc so that's essentially how the sequences are built up from the images. Okay so the initial sequencing generated by Selexa was primarily single end reads and so when we talk about that we have our fragment we have our adapters on the end of it I don't know if everyone can see this maybe not. Sorry but so we have our fragment we have our adapters on the end and we would sequence in from one end and in fact the data that we're gonna be talking about today is single end data and in fact a significant amount of chip seek data is still generated by single end Amisha will talk about the advantages of doing paired end reads when it comes to Illumina sequencing or chip seek sequencing but initially it was single end and this is showing essentially the chemistry of how this works. So I already talked about the cluster generation essentially stripping off and then allowing a priming event and to generate the second read so to read in from the other strand so now reading in on the other side the way we actually do this is by regenerating the cluster on the flow cell surface so this is done on the instrument we regenerate the cluster and then we strip off the other strand and that's done very simply by incorporating two different bases or modified bases either chemically modified or a uracil substitution on the oligos that are grafted on the flow cell surface so this allows us to control which one we cleave and allows us to do paired end reads. Okay so any questions on that? Okay if not we'll talk about the base calling well okay so in terms of the analysis steps I've already sort of gone through this so first step is base calling which is what we're gonna talk about now second step is reference alignment which Misha will be talking about and then application specific analysis for example segmentation. So on an Illumina high seek system we essentially go from terabytes of images which are now not even stored to an intensity files and then to essentially the FASQ file and an intermediate step that used to be generated called a Q-seq file which contains the base calls and the qualities okay. Some considerations in terms of the sequencing itself and you know most of this is fairly black box now but just for historical sake I'll talk about a couple of considerations one is phasing and pre-phasing so we can think about when we're sequencing we have chip-seq libraries we have this cluster of sequences and as we go through the step by step addition of nucleotides those clusters can come out of phase and there can be strands within that cluster that either get ahead of the others perhaps there was a nucleotide that was incorporated that wasn't fully terminated and we got two incorporations at that step or some others a deep protection didn't work and we didn't get incorporation and so we get either phasing or pre-phasing events and in fact these phasing and pre-phasing events are what really limits the read lengths of these platforms because eventually your phasing and pre-phasing becomes so dominant that you can't call the majority base anymore so this still remains one of the technical hurdles for generating longer read lengths on these platforms so phasing and pre-phasing is something to keep in mind this is all fairly black box now so unless you go back to some pretty historical data you probably don't need to worry about this the other thing that is worth knowing is the so-called chastity filter this is the formula for how the chastity is calculated and essentially chastity is a way of determining the purity of that particular cluster so when we think about a cluster on array we may get examples where two clusters are very close to one another or two clusters may be actually overlapping in which case our ability to call the accurate sequence for that cluster is obviously impaired and so this concept of chastity which is essentially looking at the brightest intensity of whatever nucleotide is or whatever fluorescent base comes for that cycle over the brightest intensity plus the second brightest intensity and this has to be greater than equal to 0.6 and for current generation sequencers that's over the first 25 bases with one allowed failure and essentially what this is doing is flagging polyclonal clusters this is true all the way up to the 2,500 system but after the 2,500 system when we get into ordered arrays chastity has been disabled so there are no longer any chastity failed reads and if you get a file, a FASQ file or from a 4,000 you'll see that there are no flagged reads anymore so what is the output of the high-seq what does the file actually look like this is a file that you probably won't ever see unless you're working on the sequencer itself but I wanted to show it because it kind of shows the intermediate step before we get to the FASQ file so essentially how this what this file is on the top here you can see a series of information related to that particular run and it's actually the concatenation of this information which becomes the sequence name itself which is how we identify that sequence in all subsequent analysis so for example it will include instrument ID, the run ID which of course isn't unique, the lane ID but it will also have the tile so in this case tile 1,101 and then the X and Y coordinate that I just talked to you about so that's actually how we identify each of the individual reads and then there are what read it is and chastity passed or failed the sequence is then provided and the sequence is provided in ACTs and Gs and then the quality scores are provided and the quality scores are FRED, so these are log-transformed quality scores but instead of encoding them as numeric digits they're encoded in ASCII and I'll talk about that briefly and then the chastity pass or fail so this is the Q-seq file so these values are FRED-like log-transformed probabilities but they're encoded in ASCII what is ASCII? ASCII is essentially a way of compressing the quality scores into single digits instead of using two digits we're using one digit it matters because we're generating millions of these things so it's just a very simple way of converting from say 40 into a single character say H and I just want to point out that if you're looking at earlier generation Illumina systems some of the archive data that the FRED quality scores are ASCII 64 encoded which was the original Illumina encoding more recently I can't remember when the transition occurred as a few years ago now we moved from ASCII 64 to ASCII 33 which is the FASQ standard so now everything as of today and as of a few years ago is moved to the ASCII 33 standard so how do you actually interpret that you need to know what the ASCII encoding is if it's ASCII 64 for example we would look up and for example the letter say was H you would then look on the ASCII table see where H was H equals 104 we subtract 64 from that and that gives us 40 and that's our FRED score if it's ASCII 33 then we would want to subtract 33 not 64 and you can figure it out even if you don't know what it is you can figure it out pretty easily by looking up a few of the a few of the quality scores and you can see whether you're in the right range or not okay so from the Q-seq file the FASQ file is generated and this has really become the universal standard this was developed at the Sanger Center as a way of encoding massively parallel sequencing data in a compact form and so this is the format that you should be aware of which is essentially four lines the first line begins with the at character and then has a sequence name and that sequence name as you'll see when we look at FASQ files is an incorporation is a concatenation of the information related to that run and for example when we talk about Paradan reads of course Paradan reads generate two FASQ files one for read one one for read two and the only way that we relate read one and read two to each other is through their name because the name is the same but the read number is different so that's how we associate read one and read two together in a FASQ file is through the naming okay so we begin with the unique name followed by raw sequence letters shown as here then there's a plus character which sometimes you can have an optional identifier associated with that line but typically not and then we have an ASCII quality scores and for FASQ files the ASCII quality score is encoded in base 33 okay and it's base 33 because 32 I think is a space which you can't encode so that's why 33 is used there's no magic to 33 okay so so I've kind of gone through the chemistry this is the output of your chip seek experiment is a FASQ file so now you've completed your chip seek run or you had someone else run it you've got your FASQ files how do we assess the quality of this resulting file of course we have millions of sequences and so we need some way of assessing the quality across all of those sequences in a standardized way and so the tool that we'll be talking about today is FASQC and we'll be talking about that in the practicum after the break and this really provides a set of standardized metrics for assessing the quality and looks at such things as GC content over represented sequences within the data set quality scores etc and as a sort of take home or take away what I'm going to be showing you as we go through the practicum is that biology matters so it depends on what mark you're IPing you're going to see different results in your FASQ file to understand what it is you're IPing to be able to interpret the quality that comes out on the FASQC in the back end and we'll be going through an example of how to run it and then how to interpret the output for some common marks okay so in the last I guess 5 or 10 minutes here I'm just going to very high level overview talk about the remaining levels but of course this will be the topic of Misha Blinky's practicum in later in this afternoon so once we have the FASQs we convert those FASQs into the next standard which is once we have those FASQs we align them to a reference genome and generate what has become the de facto standard for alignment which is the BAM file which is essentially a binary representation of the SAM file which essentially shows where that fragment aligns to the genome and associated mapping qualities and then from there we go to segmentation and transformation where we're actually trying to model the behavior of that of the mark that we're interested in on the genome and Misha will talk about some techniques and tools for that but I just want to point out that that second step here the generation of the segmentation still remains a very active area of research and there are many many many many tools out there the tools that we're going to talk about today are Max2 and Finder Finder is a tool that Misha has developed in-house that overcomes some limitations of Max2 Max2 is probably the tool that most of you who have worked with ChIP-Seq data are probably most familiar with which has some issues especially as we deal with deeper sequencing data that you probably should be aware of and how it calculates background so at a very simplistic level how do we actually generate or how do we transform a line reads into peaks or enriched regions so very simply this is how it's done and this is actually from back in 2007 or 2008 I think but the principles really haven't changed and so the way we do this is we start with for example a line read and as I mentioned many ChIP-Seq data sets even data sets that are generated today are single end reads and so we take that a line reads and we extend it computationally by the median length of the fragment distribution that went in so here is a molecular biology trace this is something called an Agilent Bioanalyzer but you can think of it as like a gel which shows a distribution of sizes for that particular fragment and so we can take that distribution and we can determine what the median length of that distribution is and then we can computationally extend the read by the median length and of course you can think of many reasons why this might not be as accurate as it could be because of course there are many fragments that are smaller and larger but we can use this median length as an approximation of what the length of that fragment would be of course if we're using paired end reads on both the gray ends we know what this read is and we know what this read is and we don't have to do this so-called extension we can just use the read boundaries as are defined by the paired end reads to tell us where that fragment the length of that fragment is and where it is aligned on the genome and then has some advantages as Misha will discuss but once we've had those reads and we've aligned them we can then cluster them and then read clusters that point towards one another so here I have reads that are annotated in blue that are moving 5' to 3' and reads that are red that are moving 3' to 5' or that are oriented 3' to 5' so we look for these groups of reads these clusters of reads and then by extending their length using the median length we can generate a representation or a so-called peak of what those reads might look like and so we have these peaks and then we have to come up with a way of determining whether that peak is of significance or not and again Misha will be talking about that in some depth but essentially how do you know if a peak that you see in your sequence or in your chip seek experiment is a real peak and of course this is the crux of the bioinformatics problem initially the initial tools use essentially a random distribution so you take that same set of reads and you randomly redistribute them back onto the genome and you'd see what the maximum peak height would be with that random distribution and you set that as your threshold anything that's above that you would consider to be an enriched sequence there are other methodologies to do it and again Misha will talk about it one of the things that's commonly done is sequencing a so-called background molecular library this is typically the input material that went into the IP so this should be representative of the fragments that went into the immunoprecipitation so-called background or so-called input library and max2 has you can run max2 in such a mode that it will calculate your enrichment for each bin based on the the alignment of your library your IP library minus the signal for your input but that has problems too I just wanted to be clear on are you saying that input so for IHEC we recommend it I think you have to be it's not required but it is a helpful addition to be able to determine regions that are showing enrichment as a consequence of your of your library preparation ok so for example you might find that there's a region of the genome that is shearing at a higher frequency than say the adjacent regions and so you're going to have overrepresented fragments even in the before you even start the IP of that particular region so when you do your IP or when you sequence the input you're going to get a peak at that region simply because that region had a higher frequency of shearing or for example M&A's digestion so you know there is definitely utility in sequencing the input but the flip side of that is there are regions if we take just an input library and Mike Snyder and the encode consortium actually published a paper on this you can actually do a pretty good job at predicting open regions of chromatin just based on the input library so when we're looking for example regions say H3K4 trimethylation that is located in open regions of chromatin and we simply subtract the input library we're actually down sampling where the real signal is so it's not black and white unfortunately most folks when we're running max use the input and we'll do the subtraction but I think Misha will talk about this will you Misha? Misha will talk about some of the complications of that but input libraries are recommended and in fact for publications typically you'll need to include an input although how you process that input and how you use that input in your analysis I think is very analysis specific and something you need to keep in mind okay so don't just blindly go in and okay I'm going to subtract all my input realize what you're doing think about the biology behind why you're getting signal in your input and how that might change the or at least modify the result okay it's a good question yeah single end read yeah can be yeah that's right million fragments correct how long do you read so that's a mapping issue so how many unique positions in the genome can you align it a particular read and it's to some degree budget driven I think the data we're going to look at is 36 base pairs but there's no I mean the longer the better to some degree as long as you're not reading all the way through your insert and then you're just burning reads on your adapter so if you're reading you know if your insert sizes are 200 base pairs and you're doing paired in 150s a lot of that you know all of the sequences you're going to be generating are essentially adapter so it's not particularly useful so think about the average size of your insert versus the sequencing length but I can say that if you're designing a chip seek experiment I would highly recommend you do paired end reads for both a moderate or a minor increase in mapability you'll get about 2% or 3% more higher in regions of repetitive sequence but also the ability to model the fragment size the precise boundaries of that fragment as opposed to doing an essentially computational extension based on the median size yeah well I suppose if you read into the adapter and you trimmed your adapter and you knew if you read the entire thing then no you would not but that's typically not done I suppose you could do it yeah you had a question so that's essentially in the generation of the clusters so the the first bases the intensity is lower and then the sequencer or the algorithm is able to accurately distinguish the position of the clusters so yeah you do get a lower you get the first three or four bases will have a lower base quality and we'll see some examples of that in the FASTQ but that's not a chip seek thing that's a sequencing thing we see that across all sequencers and some people trim the first six base pairs especially for example for RNA sequencing when you're using random priming but that's not the topic of this discussion but yeah it's a good point any other questions so if not I'm going to very briefly talk in the last like three minutes I know I didn't include this slide but we can update these slides but how do we actually assess the chip seek quality itself and so there are a series of metrics that one can look at and I think Misha you're going to be talking a little bit about this as well I'm assuming no okay so I just want to show you so again this is a this is an active area of of research is how do we tell whether a particular chip seek library is of good quality or not and so I just wanted to show you some of the metrics that we calculate or that can be calculated for looking at the quality of an IP so this we're looking at a particular mark this case H2K27 acetylation this is all available for you guys to go look at so each one of these columns represents an individual library an individual IP and you can see it the first thing is just essentially the total number of reads not particularly useful one can look at the percent mapped and so of course one would expect your percent mapped to be you know in the range of 80% or so for a good IP if you have IPs that have low percent mapped you should be concerned and the threshold used to be 50% or below but again it's up to you to make that distinguishing there isn't you know a hard cutoff you can then look at of the number of reads within that population that are uniquely aligned to the genome so looking at whether there is a large fraction of repetitive sequences and for that we use mapping quality which Misha will be talking about in the context of the BAM alignment so mapping quality is a way which is different from fredscore but it's a log transform probability that the base is aligned incorrectly and we can look at if you have Paradan reads how many are properly mapped you can look at the percent of duplicates so how many duplicates do you have in your library so here when we talking about duplicates we're talking about PCR duplicates and PCR duplicates are defined by reads that have exactly that have the same start and end position on the genome and we typically flag these as PCR artifacts so amplification artifacts but if you think about it if we sequence a library deeply enough then all the reads are going to be duplicated right so if we have if we just keep sequence in our library we're going to eventually sequence every fragment more than once right so so interpretation of the percent duplicate is not again black and white it depends on the biology of the mark so in this case we have a punctate mark that's found in about 2 or 3 percent of the genome we can find these through K27 acetylation so even at low sequencing depths we're going to start to see duplicates arise as a consequence of their of the distribution of that mark in the genome and how deeply we sequence so you might ask well do you collapse duplicates or not and again Misha can talk about this but it would depend on the mark typically we do not collapse duplicates for chip seed data because by doing so we're actually bringing down the signal in regions that are that are highly marked by that particular mark okay and then finally we can look at things and this is a metric that we've been developing in our own center which is looking at how much of those reads align to regions that we expect that mark to mark so for example for H2K27 acetylation we take all the ensemble gene promoters we take around the transcriptional site and we ask if we look at all the reads what percentage of those reads align back to that predicted domain or what we would predict H2K27 or at least one component of where H2K27 acetylation would arise and you can see that typically you get a fairly good representation and that representation is fairly stable across a whole bunch of libraries or whole set of libraries again each one of these represents a single library but there are some libraries that have more reads that are associated more fractional number of reads that are associated to a promoter region and some regions that are not and I'm not going to in the interest of time I won't go through that through all of them but you can look through each of the marks that we profiled here and you can see that different marks have different performances so for example H3K36 trimethylation where we look within gene bodies and we ask what fraction of reads align back to gene bodies in our IPs we get a much stabler performance over time than we do for example for H3K27 acetylation so how do you measure the quality of an IP you can look at the mapping qualities, the mapping characteristics and you can look at where you think that reads should align in the genome and what fraction of reads align within those regions okay so with that I will end it there and I guess it's coffee break time question so when it comes to the issue about duplicate reads one way potentially to get around that is to include with your PCR fibers some random random run that if you have a duplicate read that's of the same length than the genome but has different sequences with no additional indices those would define true distinct reads would they not? Right and so that's the technology so this idea of adding other randomers we talked about indexes you can also add another randomer in your library construction and that's typically done for recounting exercise such as RNA seek and micro RNA sequencing and the reason it's used there is because we have a random priming event that allows us to bring that all to go in when we talk about chip seek libraries we're not doing a random priming typically at least I'm not aware of technologies that do it that way so you would have to ligate that randomer on to the end of your on to the end of the adapter which you could do but you'd have to modify the sequences to allow you to do it but that would be a way of distinguishing between true PCR duplicates or true duplicates and PCR duplicates because true duplicates would have different randomers on the end versus PCR duplicates which would have the same randomer and the same position so I mean I think but typically we do not collapse duplicates for chip seek reads and I think the reasons for that are hopefully fairly apparent Misha is collapsing them RNA seek we don't sorry chip seek we do okay any other questions if not we'll go to coffee break thanks guys