 So, I introduced myself fairly briefly, but just to get you an understanding of the kinds of things we do in the lab. I heard there's quite a wide variety of backgrounds. We've got professors all the way to undergrads in the room today. So it's always hard to know how to start, but what I'm going to be doing is giving an overview of epigenetics, essentially, to begin with, and then I'm going to step us into how the data is generated for the majority of epigenomic analysis. I'll be talking about how the sequencing works, what the output file is, which is a so-called FASTQ file. How many people in this room are aware of what a FASTQ file is? So half or so. So for those of you who know epigenetics already, and there's quite a few, and those of you who know what a FASTQ file is, you can perhaps have a little bit of a snooze. While I go through this, but I'm basically just going to give us an introduction and get us to the point of a FASTQ file. And then in module one, we'll do what the first step of any good epigenomic analysis is, which is looking at the quality of that output file. And again, as Anne introduced, the point of doing this all together is to provide opportunities to discuss. And so I do encourage you to stop and question, and we can have that discussion. Otherwise, I don't want to stand up here and make this didactic, right? So having that discussion and getting those conversations going, I think, is key to making this successful for you. Obviously, this is a huge field, and it's impossible to cover the breadth of questions that you're going to have in the context of an hour or even two days. So my lab focuses primarily on malignancies, and we study types of cancers that are driven by epigenetic dysfunction, as we understand it from the perspective of genetic lesions. So we study cancers that have mutations to epigenetic modifiers. And I heard some of the folks here in the group are also doing similar things. So for example, synovial sarcoma is a type of cancer we study. We study leukemias, looking primarily at the IDH-TET2 pathway and the role that it might play in initiation and progression of the tumor. So that's the kind of things my lab does. I come at it from a technology perspective, so we've developed technologies for doing chip seek, RNA seek, all these various things. Working along with Misha over the last, my gosh, decade I guess now. So developing various tools to do this analysis. And more recently, we've been working on a next generation tool for actually doing chip seek peak calling to get around some of the issues that Max2 has. We're gonna get into that to some degree. But again, I think we encourage you to ask questions and probe to understand some of the complexities of the field. All right, so in terms of what I'm gonna be talking about today. So again, I'm very briefly gonna go through some epigenomic principles. What is epigenomics? Why do we study it? I'm gonna describe the basic molecular biology driving massively parallel sequencing. So this is the technology that really launched the field. I would argue that the reason most of you are here is because of the development of massively parallel sequencing. If it wasn't for that, we wouldn't be looking at the epigenome. And in fact, that's how I got into the field. Because I was working from the technology side of things working on what at the time was the CELACSA platform and chip or chromatin immunoprecipitation working with a guy by the name of Tony Coserides. Some of you might know him from Cambridge. We ran some of the first chip sequel libraries back in 2007. Because at the time, that's what the sequencing platform could do. We could generate about 10 million reads, real short 27 mers. And it was for the first time, it was like magical. You could actually look at K4 trimethylation genome wide. And really, I've been working from that perspective ever since. So that's how I got into it, but I'll be talking about that. Because that, of course, is the fundamental unit or the fundamental measuring technology that a wide, not all epigenomic measurements, but certainly a large fraction depend upon. The output of that is the FASQ file, which becomes the input for all of the downstream analysis. And Misha will carry on with the description of what we do with the FASQ file. I'm going to talk about understanding some underlying principles and challenges of chip sequencing. Again, looking more from the wet lab side of things, but then get into some of the dry lab measurements that we use. And then a very high level overview of the analysis workflow that we'll then feed into module two. Okay, so very brief history of epigenomics, so I don't know. Again, some of this probably is review for a lot of you. But let's remember that the term heterochromatin and new chromatin, which hopefully most of you are familiar with, is actually a term that was coined almost a century ago, right? And it was coined in the context of understanding why regions of the chromosome, in this particular case, this is actually looking at mosques, but why some regions of the chromosome took up more dye than others. So these heterochromatic regions, these densely staining regions, so-called heterochromatin, and then the regions of the chromatin that were less dense or took up less of the dye, these are the so-called u-chromatin regions. And of course we know now that the u-chromatin regions are where the genes are present and where the majority of transcription occurs. So a few years afterwards, in fact, this is only a year afterwards. These are some of the first studies looking at position effect variegation. And really I wanted to point this out was this is really the first discovery of the fact that the, but the heterochromatin is not a static unit, right? It's not, the heterochromatin isn't there and doesn't change, but it actually is a dynamic unit. And this was shown through these translocation experiments, where when this gene was placed near this boundary of a heterochromatic element, it would be silenced in a non-Mindelian way. It's suggesting that the heterochromatin itself is dynamic. So of course, hopefully most of you are familiar with Conrad Waddington's work. Again, this is a few years later, but this of course is before we knew that DNA was the hereditary unit. And Conrad coined this term in 1942 and he coined it in the context of understanding how genotype gives rise to phenotype. In this particular case, he was studying fly wing development, but suggesting how does different genotypes give rise to the same or how does different, the same genotype give rise to different phenotypes. And he coined this term the epigenotype, which of course was then developed further into the epigenome. So shortly thereafter, of course, famous experiments by Avery McLeod and McCarthy actually showed that DNA was the hereditary element. So it was in fact DNA where phenotype could be transferred. So now we understand that there's DNA. We understand that there's this heterochromatin and these u-chromatin compartments, but we still haven't figured out or we still have no real understanding of the epigenetic mechanisms. So just a few years later, in fact, looking through these TLC, so looking at this thin layer chromatography, it was observed that, of course, the four main bases are here. So we have guanine, cytosine, adenine, and thymine, but there was this other component making up a few percentage of the DNA, which was named epicytosine. Okay, so it ran near cytosine in the TLC, but it was called epicytosine. And actually, looking at this epicytosine, this is the actual publication from 1948, and you can see that they conjectured or they hypothesized that this epicytosine was actually five methyl cytosine. And this was through studies of bacterial DNA, looking in particular TB, which showed that they had a similar size. So now we have an understanding that there's the four main bases. We know that these bases are the context of hereditary material. But now we actually understand that some of these bases can be chemically modified, and we think that this chemical modification is DNA methylation. So moving a little bit further in time, now we're up to 1951. Of course, Barbara McClintock famously discovered these discrete elements of the genome, so-called transposable elements that have regulatory potential. So these are these regulatory elements that we, of course, study quite extensively now using epigenomic tools, such as ShipSeq, looking at marks such as K27 acetylation. But these discrete elements could jump around the genome, and wherever they lay, they could actually regulate the gene that was nearby. So this concept that there are these discrete regulatory elements that we can move around the genome, and we can measure these, and these can then regulate nearby genes. A few years later, the discovery of X in activation, hopefully everyone in this room is familiar with dosage compensation in females. The context that, of course, females have two X chromosomes, and one of them must be silenced. And Mary Leon did the fundamental work, and in fact, originally posited that this X chromosome in activation occurred, and was mediated through epigenetic mechanisms. Okay, so now we know that there's DNA methylation. The DNA methylation can control gene expression. But what is it that DNA methylation is doing in the genome, and what is its role in normal differentiation? And so there are a number of papers, and now we're up to 1975, 1980, really looking at how DNA methylation, or positing, or hypothesizing what the role of DNA methylation might be. So DNA modification affects gene activity during development. And this is where the first concepts really emerged, suggesting that DNA methylation was playing a role in controlling differentiation. Really controlling how genes are being transcribed as we differentiate from a totipotent zygote all the way through to a terminally differentiated cell. And suggesting that DNA methylation in particular was playing a role in this mechanism. Around this time, pioneering work from Peter Jones and Stephen Bailin and a number of others showed that a cytotoxic, a drug that was originally developed in the 50s as a chemotherapeutic, which was highly toxic, so called 5ase acytidine. But at limited doses, if you added it to culture into cells at a very dilute concentration, could actually drive those cells to differentiate. So this was very, of course, very surprising. And again, supported this concept that somehow epigenetic mechanisms were playing a role in mediating differentiation. Okay, so that's kind of the overview of DNA methylation towards the concepts of differentiation. And now I'm just gonna bring us all the way up to date and give you an overview of what we understand about DNA methylation as a sort of primer for the rest of the course. So this is a review in nature by Dirk Schubler, if those are interested. So the first thing I want to point out is that CPGs themselves are unevenly distributed in mammalian genomes. And everybody, I'm assuming, knows this. Do people know why this is? Why are CPGs unevenly distributed? I'm sure some of you know. Sarah has to answer in the back. Right, but so why are they enriched in promoters, I guess. Right, but from an evolutionary perspective, it's through this mechanism of deamination, right? So these so-called CPG islands, which I'm sure many of you are familiar with, are actually just remnants of the ancestral genome, right? It's not that CPGs collected at promoters, it's just that the rest of the CPGs in the genome have slowly degraded away. And they've degraded through spontaneous deamination, where when we lose, when we deaminate five methyl cytosine, it becomes thymidine. It's not recognized as a base error by the base excision repair mechanism. And if it's not under selective pressure, it's essentially lost. And so the concept is that these CPG islands, these so-called dense or CPG rich regions, are actually remnants of the ancestral genome that are under selective pressure. And we think they're under selective pressure because they're actually playing a role in, as Sarah pointed out, in regulating the genes. So that's one concept to understand. The majority of CPGs in mammalian genomes, I work in mammalian genomes. So the majority of CPGs are methylated. This is not true of all organisms, but certainly true of mammalian. And there's an overall decrease correlated with differentiation in particular cellular context. So this has been studied quite extensively in hemopoietic differentiation, an area that I work in. And we know that, for example, in B cell differentiation, we lose methylation globally. This is also a characteristic of disease states, like malignant genomes. We know that we lose methylation genome-wide. However, paradoxically, we also gain methylation within CPG promoters. And I know Piyom will be talking about that a little bit more. Dogmatically, if you look in textbooks, we think about DNA methylation as being a repressive mark. And when we study it, and when we do bioinformatic analysis on DNA methylation, we're always, I mean, you can read numerous papers as, OK, I found a CPG that's methylated in blood of these 100 infants. And it's nearby this gene X. Therefore, this gene must be repressed by the presence of this methylation. But it's important to note that DNA methylation is context-specific. Gene bodies themselves are methylated. So if a methylation is found within gene body, that's probably more likely representative of an actively transcribed gene rather than a repressed gene. So you cannot simply say that because you found a methylated CPG nearby a gene, that that gene is repressed. And that's an important concept to understand. And the mechanisms by which this occurs are fairly well known. It actually goes through an intermediate step of histone methylation, H3K36 tri-methylations. I'll talk about in a minute. Most CPG islands, these are the so-called, these are these ancestral remnants of the CPG density of the genome, are un-methylated. So the vast majority of these are un-methylated in normal genomes. And when they become methylated, however, CPG islands are repressed. Now, this I think is a good figure to get into a little bit more detail. So again, when we look at mammalian genome shown here, vertebrate genome, the majority of the genome is methylated. And of course, actively transcribed gene bodies are methylated. CPG islands tend to be un-methylated, but in front of an inactive gene. So these are un-methylated. And importantly, regulatory elements, and in particular, enhancer elements. And this is something that I think many of you are probably interested in studying. Regulatory elements are un-methylated or hypo-methylated. And this is actually very powerful, something we use in the lab a lot to try and understand regulatory states just looking at DNA methylation. So you can look at just DNA methylation and predict active regulatory states looking at their methylation state. OK. Another important thing, so that's kind of an overview of methylation, but I just wanted to point out, and hopefully everyone is aware, that DNA methylation itself has been selectively lost in a number of species, including a number of model organisms. So DNA methylation itself is not required for life. It's not required for multicellular differentiation. We can lose it, and things seems to be fine. In fact, even in a mouse embryonic stem cell, we can knock out all methylation, and the cell can proliferate or can grow perfectly well until such time as we induce it to differentiate. As soon as we induce it to differentiate, the cell dies. So methylation is not required for cellular maintenance, has been lost in a number of organisms, including such things as C. elegans. But in the context of a million cells, if we knock it out, the cell dies. So that's DNA methylation in a very, very high-level overview. So of course, the other major component, or another major component of epigenomic regulation is histone modification. And this is the crystal structure of the nucleosome, in which you can see the histones, which are colored here. And importantly, what you can see is these long and terminal tails sticking out of the nucleosomes. And of course, we now know that it's these long and terminal tails, for the most part, that can be decorated with chemical modifications. In fact, we know that there's more than 100 different chemical modifications, or at least 100 different chemical modifications have been observed by mass spec. We only study a very small subset of those using chip sequencing. And I hope I will show you why we think that's appropriate in a few slides. So now going back to a little bit of the history, well, how did we find out that these chemical modifications exist, and how do we know that they're linked to changes in gene expression? So going back to 1964, where the first observations of, in fact, acetylation, so acetylation is one type of chemical modification that can occur, the first studies associating acetylation and active gene transcription were made. Of course, back in 64, and even in 78, we thought that these chemical modifications were more structural. We didn't have the concept of them being actively regulated or dynamic in the sense that we understand them to be today. I don't know why I showed this. OK, I think this is slightly, this is out of, this is, I should have been to slide before. I don't know, Scott got a little bit out of order. But just to remember everyone that, of course, DNA methylation and epigenetic mechanisms in general go through reprogramming twice in the organisms. Life once during primordial germ cell development, and then again, just after fertilization. And of course, this has become somewhat controversial, is that in the context of this epigenomic reprogramming, how can we get such things as transgenerational inheritance? And I know many of you in the room study this very extensively, so I'm not going to embarrass myself by talking any more about it. But just to get back to the histone code, so David Alice famously showed, and this is in an organism, tetrahimonin, which makes beautiful pictures, but basically showed the mechanism of, or was the first to show the mechanism of how this histone acetyltransferase acetylated histones through recruitment by Paul II, and this acted to reinforce active transcriptional states. So this is really our first understanding of how acetylation was linked to active transcription. But of course, this was more in the context of consequence versus cause. So really, acetylation was occurring, but it was occurring as a consequence of transcription rather than encoding information itself. OK, so moving a few years forward now to bring us sort of up to date, here are a series of histone modifications that are actively studied in the lab. And of course, these will be ones that we talk about today. And so we can broadly group these histone modifications first into those that are activating, such as acetylation, and this is where it was first studied, and those that are associated with repressive or heterochromatic states. So just for those of you who are unfamiliar with the nomen clature, this is the nomenclature of how we read these. So the first characters are the histone. So this is histone H3. Then we have lysine 4. So remember that it's the N-terminal tail, so we're counting in. This is the fourth lysine on the N-terminal tail. And then we have the modification itself, in this particular case, trimethylation. That's review for most of you. Trimethylation, here's monomethylation. Here's another mark, K27. So the 27th lysine from the N-terminal tail, this is acetylated. And what I want you to notice, of course, is that depending on not only the type of modification, but the amount of modification, you can encode different information. So the way we understand it is that H3K4 monomethylation is a mark of enhancer states. And you might have heard the term primed enhancers. This is a terminology that's emerged primarily out of studying hemoquatic differentiation, but it seems to apply to other tissues as well. And we have other marks, such as H3K27 acetylation, which marks active enhancers. I already talked briefly about this mark, H3K36 trimethylation, which is associated with actively elongating transcription. And K4 trimethylation, which marks active promoters. We also have repressive chromatin. And here we have the same position on the same histone. But it is trimethylated as opposed to acetylated. And that changes that state from an active state to a pressed state. And of course, we also know that these marks don't act on their own. They actually act together in the so-called histone code. This is a terminology that David Alice again coined. And we know that we can get cases where we have both active states and repressive states coexisting together in promoters, the so-called bivalent promoter that hopefully some of you have been some of you are familiar with. This was work initially by Brad Bernstein, looking at mouse embryonic stem cells back in 2006. We showed that there are a whole set of promoters that are marked both with an active and repressive state. In that particular case, those genes are transcriptionally inert in the ES cell. And then there's this so-called resolution of the bivalent state as the cell differentiates. And in fact, these bivalent promoters turn out to be very important not only for normal differentiation, but these also tend to be the promoters that are deregulated as a consequence of disease, and in particular in malignancies. OK, so why do we study? So I told you that there's 100 different modifications. And I'm showing you six here. This six and these six are an important set, an important subset that were initially selected as part of a program called the NIH reference epigenome program. Some of you might be familiar with that. Something that I was involved in for a number of years, which was a founding member of something called the International Human Epigenome Consortium. And back in 2008, when the NIH roadmap program started, I remember being at the meeting in DC, and the question was, what is a reference epigenome? How do you define a reference epigenome? And at that time, ENCODE was just starting up. Minolis-Kellis, Jason Ernst and a number of others were trying to ask this question, if there are hundreds of different modifications, how can we even begin to study them? And so the concept of defining this set of references, of this set of modifications, which is now the set that's really used, is the set that you will come across most often in the literature, and in fact, most data sets contain one or more of these particular modifications. This really comes from the concept that if we look across many different modifications, and here I'm showing you a whole set of different acetylations and methylations, that in fact, there's a lot of redundancy in the histone code. And so that set of six modifications was selected as ones that best represented the diversity of chromatin states that could be obtained by doing as many of these modifications as possible. So this was work, the set of modifications was done by Bingren's group in ES cells really saying, well, if we look at all of these different modifications, what more can we learn? And what we learned was that many of these modifications are in fact highly redundant. If we know one of them, we can be pretty certain that we know what the other ones might be. But that leaves open the question, why are all these modifications there? What is the cell doing with them? And those are still open questions. And the field's not clear what this redundancy is doing. It's not clear whether some of these other marks, and it's probably likely that some of these other marks are specific for states that we perhaps don't understand as of yet. But that's how this set of six marks was selected. And this has really become the reference set of histone marks that are used in the field currently. So just to sort of end that component or the introduction, maybe just to make the comments, what are the major questions in the field? And I think we could probably argue this amongst, and maybe we can have a discussion of the coffee break. But really, cause versus consequence of epigenetic modification is something that comes up a lot. So if we have that modification there, is that modification actually encoding information, or is it just a consequence of some other regulatory state? An easiest one to kind of think about is H3K36 tri-methylation. We know that's being recruited by, we know that's being laid down as the polymerase is traversing across the gene. So is that a cause or a consequence, right? Sounds more like a consequence than a cause. And one can make the same argument across many histone marks. So I think that's one major question in the field. And then secondarily, the scope and mechanisms of transgenerational inheritance. I showed you that DNA methylation is reset during development. So that sort of opens the question, well, if that's true, then how can we have transgenerational inheritance of epigenetic modifications? And again, I'd be happy to discuss it over coffee. Okay, so that's kind of the overview of epigenomics and now I'm gonna start getting into the bioinformatics of things. So first of all, as I mentioned, it's really massively parallel sequencing that underpins the vast majority of epigenomic assays. So hopefully many of you are familiar with the four sets that are shown here. So looking at DNA methylation analysis, and many of you, when we went around the room, talked about sodium bisulfide sequencing. That's of course one way of measuring DNA methylation. We can also use immunophrecipitation strategies, so called MEDIF sequencing, or the hydroxy-methylated version, hydroxy-methylated sequencing. So those are ways that we can measure methylation genome-wide. Of course, CHIP-Seq is another methodology, and I'm gonna go into this a little bit more detail. Open chromatin regions. So looking at regions of the genome that are open, which are available for transcription factor or regulatory state binding, we can measure those by just using simple digestion strategies, and again, building libraries, sequencing them, and aligning them to the genome. And more recently, three-dimensional strategies. Of course, we know that the genome is not a linear molecule in the nucleus, but it's actually compacted and packaged in a very tightly regulated way, and this three-dimensional structure itself can be informative in many different ways. Although I don't think we're gonna get into that in the next couple of days, but that's another modality that can be looked at using NGS platforms. Okay, so how do we do CHIP sequencing? How do we do CHIP sequencing? So this is a very basic overview. Again, hopefully this is reviewed for most of you. Of course, we have the genome where we have a bunch of histones, and these histones can be modified in some way. The first step in CHIP sequencing is to shear the DNA, and there's two main ways to do this. One, using an enzymatic process, usually MNAs, although there are other enzymes that are used. So MNAs, digesting the genome. Another way to do it is to use sonication, so just cavitation to actually shear the DNA, shears a single strand at a time to generate a series of fragments. More recently, technologies have been developed using transposons, which actually insert between the histones. So called iCHIP and flavors of that have also been developed. But fundamentally what we're trying to do is break the genome into chunks, and these chunks, of course, are the size of DNA that we can sequence on a massively parallel sequencing platform. So the reason we're doing this is just so we can have a fragment that we can then build a library from and sequence. So we break the genome up, we then use antibodies that target a particular modification. In this case, in this cartoon example, it's an antibody against H3K4 trimethylation. We immunoprecipitate that material, and then we strip all the protein away. We're just left with our double-stranded DNA. Of course, depending on the type of shearing you used, you may need to end repair, you may need to clean up the ends if you've got ragged ends of the double-stranded DNA. Add adapters to it, sequence it, and then align it to the genome as we're gonna get into. So what are some key considerations for doing epigenomic assays? Well, one of the first ones is antibody specificity and sensitivity. Do not overlook this if you do this experiment, and if you're analyzing data, this is something that's worth spending time, right? What antibody was used and what evidence do you have that that antibody actually recognizes the epitope that you're looking at and doesn't recognize others? This remains a very, this remains a significant challenge in the field. I can say for our own group here in Vancouver, we spend more money and time validating antibodies than we do purchasing them. So it will take us, well, just to give you an idea that there are a number of antibodies out there that are commercially available that do not pass just a standard validation of what's the specificity, what's the sensitivity. So if you're gonna start an experimental design where you're using chip sequencing, please spend the time to understand the specificity of the antibody. I'm gonna go through that a little bit more. Another question that you ask is which marks your light profile? Well, it depends on the question that you're asking, of course. If you're looking at regulatory states, H3K27 acetylation is a very popular choice. I could say that if I had a limited budget and I couldn't do all six, what are the sets that people typically do? That would be H3K4 trimethylation to mark promoters, polycomb H3K27 trimethylation for the reasons that I outlined, and then H3K27 acetylation to mark enhancer states. That's a common set that people use. And then you can add others to it on top, but that's a nice core set to begin with. And then a question I get all the time, and I'm sure you guys have heard of, get this as well, is how much should I sequence? How deeply should I sequence? And again, that depends on the mark that you're using. What I'm showing you here are recommendations that have come out of the assay standards working group of the International Human Epigelum Consortium, which suggests that 50 million read pairs, so 50 million read pairs, which is 25 million fragments, right? Remember now we're gonna, and I'm gonna get into this a little bit more detail, but of course when we sequence what we immunoprecipitate, we can sequence from either end, those are read pairs, we're sequencing one fragment. So 50 million reads, 25 million read fragments. For broader marks, such as H3K9 trimethylation, primarily marks heterochromatin regions can occupy a vast expanse of the genome, we suggest more, 100 million read pairs. So that's probably significantly more than some of you have seen in the literature, but this is based on sub-sampling and really trying to make a good balance between budget and sensitivity of the assay. And some would argue that perhaps even more than 100 million reads would be required to appropriately profile K9 try. Okay, so garbage in, garbage out. If you've got garbage antibody and you do immunoprecipitation and you look at the data, the data is not going to be useful. So do take the time to look at the specificity of the antibody. This is data that we published through the CERC, or the Canadian Epigenetics and Environment Health Research Consortium Mapping Center that I run here in Vancouver. So we published it, this is the RQC that we run for all the antibodies. We use a 384 peptide array to actually look at the specificity, calculate the specificity of the target peptide over the other peptides on the array. We do a Western blot and if it's a new catalog number, we will then do a chip seek experiment, actually sequence it and then compare it back to existing data. We do that for every new cat and lot number that we get from a commercial entity. And even if you have the same catalog number in different lot numbers, that could be a completely new antibody, right? There's no, the catalog number has no bearing on what the antibody is itself. This is just another example. This is H2K27 acetylation, acetylated antibodies tend to be the most difficult to validate and are the most challenging to obtain a high quality antibody for. Again, I'd be happy to discuss the details of this in over the break. Yeah. You're not allowed to ask, no? I'm not. I mean, so if you have, do you need to, how stable is it, do you need to retest them or do they stay good forever? So when we bring them in, we then aliquot them and snap and freeze them or store them at four degrees. So we essentially have working stocks and how stable it is. I mean, it's stable for at least a year. We validate every, we will redo the validation at least from the chip seek once a year if we have a sufficient stock, but it's a good point. So you will see some degradation in the sensitivity over time. Specificity is fairly stable as one might expect. Okay. So now we're gonna get into a little bit of the, or we're gonna get into the bioinformatics side of things. So this is a kind of cartoon of how one might think about chip seek processing. And there's two modules here, module one, which is where we'll start with, which is really looking from, looking at the base calls and quality of the data that's emerged from the sequencing platform and you guys are gonna run FastQC to actually do that on chip seek data in module one. Module two then takes that FastQC data and generates alignments to the reference genome and the standard file output of that is called the BAM file. So BAM file is essentially those reads aligned to the reference genome and it provides you a series of probabilities of the quality of the alignment as Misha will discuss in detail and where that alignment is on the genome. The output of the BAM file is the WIG file. So WIG file is a compressed version of that, which is usually the file type that you will then go on and do your analysis with. So run your R packages on or associate with genes. So these are the main fours. FastQs, raw file, BAM is the alignments, WIG is a file format that encodes a compressed version of that aligned to the genome. And then of course there's numerous different ways to then take that data, integrate it, visualize and do analysis. Okay, so now I'm gonna go through the basics of the sequencing. Hopefully, maybe this is review for some of you, but this is essentially how the sequencing is generated from these fragments. So we start again from our DNA, we shear it to random fragments. Of course, NGS data doesn't really care. This could be anything. In this case, of course, we're doing chip sequencing. We end repair and add DNA sequences of known sequence, these red and green tags on the end. And then we PCR amplify to generate a library that's ready for sequencing. So it's got, now it's got this yellow and orange tag that allows us to generate clusters on a sequencing platform. So how does the actual sequencing work? So this is a cartoon from a number of years ago, but the basic principles haven't changed. Yeah. Yeah. Sequencing them. So is it possible if I have, you know, DNA shear and sequencing them using the you know, this ligation and I use the, before, you know, like gaping, trivalentation, or 27th trivalentation, I use the, as an example. Yeah. Can you actually come, like, bring them to a catalog? Yeah, so now we're, so that's a great question. So what we've seen is that you'll get clustering based on the technology. So we actually see sub-clusters emerge that are M&As. So if we take a set of data, we say we have 50 chip-seq experiments, and then we have some subset that's cross-linked and sheared and some that's M&As, you will get patterns that are, or you'll get sub-clusters based on the technology. That said, I had a co-op student who worked, you know, six months trying to define, you know, standards of what those, you know, it isn't predictable in any way, but it is something certainly to be aware of. There are many regions where you're gonna get readouts that are the same. So an overall correlation is about 0.9, 0.92 between chip-seq and M&As chip-seq. And what you see is that there are regions that M&As reaches into uniquely, and those are the ones that will tend to cluster that group together. And there are regions that cross-linked will reach into uniquely. So you get unique regions, the majority of it is shared, but when you're doing clustering and you're looking for one or 2% difference between samples, it's actually gonna drive a difference. And so you certainly need to be aware of that. Yeah. Is that because of the confirmation of the data? Yeah, I mean, I could hypothesize it. It's certainly, it's reaching into areas, it's reaching into different areas. Why that might be, I don't know, but it probably has something to do with the density or availability of that region to be either shared or digested with M&As. But it is important, so keep in mind, yeah. Does it mean that although it starts at the same site, because it was digested, like always? No, no, they don't, no. So it's chewing at the ends of the open, well, at least the way we think about it is that we've got a nucleosome, you know, you've got a nucleosome, a nucleosome, and you've got DNA. So it's chewing in between, but it isn't just in one spot. It can chew all the way in on either way. And in fact, we published a, so not to spend too much time on this, but what we found is that when we do M&As digestion and then immunoprecipitate and then we align those frags to the genome, we can actually see different lengths, depending on the type of target we use. So for example, for K27 trimethylation, even though when we look at the size distribution of the input, let's say it looks like this, and this is say 200 base pairs, after we do the IP, we actually see a shift in size. And even though the input is all the same size, we're actually enriching for two nucleosome components, probably because the affinity of the antibody is such that we're actually pulling down these. So we see this, whereas if we look at H3K4 trimethylation, which marks active promoters, we see a majority of it at the input size. So there's a lot of nuances in the experimental side of things, and as much as you can learn about that when you're doing your analysis is important. Okay, so great questions. So moving on, so how do we actually do the sequencing? So again, we start with these DNA fragments that could be MNAs digested, they could be cross-linked, or more recently they can also be transposon tagged. Of course, the transposons add the purple and blue segments on the ends as part of the transposition process. We then flow those over a flow cell, this was emanated at one point, because it's a PDF, it's no longer emanated, but essentially on the flow cell surface, we have all the goes that are grafted to the glass that hybridize to either the blue or the purple, okay? So when we flow these DNA fragments over the surface of the flow cell, we actually get a hybridization between either the purple or the blue, right? So maybe I'll just draw it. Can I erase this or no? Is it okay to erase this? Okay, so flow cell surface, we have, I'm just gonna use one color for the sake of argument, but essentially we get, we flow our DNA fragments over, this is three prime, this is five prime to three prime, this is five prime here, and essentially we get this hybridization effect where we get hybridization, which then forms a priming event, okay? That's what's important. Priming event occurs, and then we're able to extend this one strand, and there's a complimentary oligo for the other end, the five prime end of this, which then falls over, and you can kind of see that in this compressed version here, so you see this kind of bending over event that occurs, okay? This is the so-called bridge amplification, because essentially you get this advanced, but this occurs, you get bridge amplification, you get this event here where you've got hybridization here, right? So that forms the first strand, we then denature that, repeat that process over and over and over again to generate clonal copies of that one DNA molecule in a location on the flow cell. So that's how Illumina sequencing works. The other primary way of doing this is on beads, so the life technology platform, or 454, for those of you who remember that platform, where instead of doing it on a solid surface, we actually do it on a bead, and so we have a bead, but the concept is exactly the same, that's been decorated in oligos, and to generate these individual reactions instead of partitioning on the flow cell surface, we actually partition in an oil-water mixture to form these micro-reactors, right? And the challenge there, of course, is that we have to use Poisson to actually get one bead, one strand of DNA inside some subset of the number of micro-reactors, which means we have to use a lot more volume, which is one of the reasons why these types of technologies have not really caught on to the degree that, for example, Illumina has where we've got the solid-state surface. Now, the original Illumina platforms, up until just a few years ago, actually used random placement of these oligos on the glass surface, and so you've got this random pattern of so-called clusters on the flow cell surface, and more recently, and some of you might be aware of the more, so the high-seq 4000 or the high-seq X platform that uses an ordered array, so it's actually, so now the oligos are placed in an order on the flow cell surface, and this has led to some problems, bioinformatically, where we're getting read through from adjacent clusters on the flow cell surface, something we call optical duplicates, and this you will see as one of the outputs of the QC of the run. If you've got a high number of optical duplicates, you've got a problem with your run, and I'll go through how that happens in just a minute. Okay, so again, this was animated, but just we can go through it. So essentially, once you've generated the clusters, the sequencing itself works like this. We start by adding a primer, in this case, the primer is five prime to three prime, we then add the nucleotides, which are labeled into the mix. This then incorporates the first nucleotide, and of course the technology that Illumina became, well that Illumina is really based, its sequencing platforms on is the reversible terminator, so didoxy sequencing, Sanger sequencing, of course uses a terminator that basically has lost the three prime hydroxy group, so it cannot extend any further. The breakthrough in terms of Illumina is the fact that they use a reversible terminator. So once that's incorporated, you can remove that, once it's incorporated, you wash everything else away, shine a laser on that particular, shine a laser on the flow cell surface, take an image, and then reverse the terminator, add the next nucleotide, and so on, and so on, and so on. So that's sequenced by synthesis. We're doing it on the flow cell surface, on a set of clusters that, right, so we've got our flow cell surface, we've got our clusters that are formed, and we're sequencing them one at a time using the sequence by synthesis technology. Okay, so how does the actual sequence, yeah, go ahead, sorry. Say that again, sorry? Dark cycles, so now you're talking about, so on some platforms, on in particular the next seek, which is a relatively recent platform from Illumina, they've eliminated one of the, they've changed the chemistry, so there's actually only two or three signals that are determined, and one, they're determined by the lack of a signal, so-called dark cycle. So I'll get, maybe when I go through this, so when you talk about, maybe I'll go through this, and then I could maybe explain what the dark cycle is, but essentially, so how we actually convert this information into a sequence string is shown here, so we start with our first image, and this first image, and in fact it's the first 25 images, generates something called the focal map, okay? The focal map is important, it's how the sequences become named, and it's actually how we relate read one to read two. So the focal map essentially takes, so now we're gonna look at the flow cell surface this way, so now we're looking, we're looking this way on the flow cell surface, so we break the flow cell into tiles, so the flow cell has a bunch of lanes, the lanes themselves can be broken into tiles, and within these tiles we have a bunch of sequences that are being generated, and initially this was random, more recently on the 4000, the next, and the highest, and X platforms, these are ordered arrays, so the arrays, we actually know what the position is, but the concept is exactly the same, so we take this lane, let's say this is lane one, of course lane two is here, lane three, most, well I shouldn't say that, but a typical flow cell has eight lanes, although there's various configurations now, doesn't really matter, but we take this tile, we break it into X and Y, so we have XY coordinates, and the focal map essentially assigns for every position that it's getting a signal for, it'll assign an XY position for that cluster, okay, and as I'll show you, that XY position becomes encoded in the sequence name, and is how we keep the sequences straight throughout the rest of the bioinformatic process, if you don't have the XY coordinates, you have no idea how to relate one lead to the other lead, and the sequencing is done independently, so we do read one first, then we do our index read, and then we do read two, and we do our index read, and we can talk about the details of that, again all of that is bound together by these XY coordinates, so the XY coordinates uniquely identify that sequence on the flow cell surface, and allows us to know that read one and read two are related to one another, or know that read one and the index is related to each other, so the focal map is a critical component of generating the sequence. Now, when we sequence some type of libraries, not usually a problem for epigenomic libraries, but if you have a library that has a low diversity, so for example, Amplicon libraries, this can cause a problem in generating the focal map, because we have one base that essentially overwhelms the camera, because it's all A's on the first base of incorporation, it overwhelms the camera, and it cannot generate the focal map, that's why maybe some of you are aware, we spike things in, like Phi X, we spike Phi X in to provide that diversity to allow it to assign a position to the X and Y coordinates. So once we've assigned the X and Y coordinates, this is shown in that first image, we then essentially, you can think of it as a stack of images, we then walk through this stack of images, essentially looking to see what base was in, what fluorescent signal comes off of that base at each of the cycles, cycle one, cycle two, cycle three, cycle four, so you can just see there's T, T, T, T, T, T, I can't see that, G, T, right? So once it's generated the focal map, then we simply step through each of the images, looking at that X, Y position, recording what the intensity is that emerges from it, and then we convert that into either a T or a G. Now dark cycles are when we incorporate, and I don't remember which base it is, that's the dark cycle, but when you incorporate that, you essentially don't get any signal, and you interpret that as being the base that is encoded within the dark cycle. As far as I know, that has no negative consequences in terms of the quality. We, I mean, I don't know if Guillaume uses Nexseq for old genome bisulfite, there has been some suggestion that there might be some quality concerns with old genome bisulfite data, but we don't, we haven't seen that, although we don't use it that frequently. Typically we're doing old genome on X platform now. Yeah. I'll approach it. Yeah. You said two plus 25. Mm-hmm. Like as a scientist, you are reading the baseball at an adult and reading the focal map. So you're refining it, you're refining the focal map over the first 25. The first generations of the Selexes, it was like the first five images I think it used. Maybe the first one was probably the first image, but they found that the quality of the position, as you can expect, I mean, these things have different sizes. Some of them are closer to one another than others. You get bleed through, yeah, you see that this one's a big one, but you got these little small ones over here. So there's a lot of little nuances that occur in generation of the focal map. Completely empirically. All these images take a lot of space, so you're just storing it. Ideally, you would take all of the images and then you'd do an even better job, but this is actually processed in real time because of the size of the image. So the threshold is set so that you have enough, but then it's processed in real time. These images are not kept and are not transferred. So just once the map is in place, then your process and information is captured and those images are DNA. Good point, yes, question? How does it work in two, five, right? Do we have some orange or do we have some white? So you can tell you about here? Yeah, so I'm gonna talk about that in a second. So why is it that we can only read 100 faces? So that gets into something called phasing and prefacing, which is one of the failure modalities and hopefully that will become apparent as we go through that. But yes, obviously biology is not black and white, there are missing corporations and their chemistry itself can be problematic and I'll talk about that in a minute. I just wanted to just spend very, very briefly just to remind you that we can also do paradigm reads, right? So many of you are familiar with paradigm reads. Certainly for chip seek data, paradigm data has significant advantages over single end data, although you might be using single end data. With paradigm data we can actually predict or we not predict. We know the size of the fragments as I think Misha will talk about. That provides some additional power in the analysis downstream, but you can also do single end. Hopefully everybody's familiar. Is everybody familiar with a paradigm read is? Do I need to go through this? No, everyone knows, right? We're reading in from either end of the fragment, but just to remind you that that's actually happening by actually after we do our first round of sequencing, we actually regenerate the cluster but going the other way now. And that's what gives us the other end. So we're actually, we do the cluster generation the first time we clip one end, we block and then we can extend and that's read one and we do it the other way. Yeah. Oh, excellent. So the, well, again, Misha, so the number, the actual increase in mapability is about 2% in a mammalian genome. So it's not a huge increase. So remember as we'll go through, the reads are, well, I won't steal Misha's thunder but it's about 2% increase in mapability but what it does allow you to do is know what the fragment size is. So when you do single end reads, you can computationally predict what the size of the fragment is that you're sequencing or that you've sequenced, but when you read both ends, then you know where the stop and ends are. And if we know that we're dealing with a nucleosome we can actually use that information in our analysis as we'll go through. Yeah. Yeah, sorry. Well, yeah, I mean the fragment sizes are more about how you build, how you transform that data into essentially signal intensities. So where that chip seek peak is rather than the mapability itself. Right, so now you're getting it, so this is a fairly specific case. So now we're talking about mixing two strains of mice together to get us variance that we can use to then predict paternal and maternal allele. And there you're gonna see a little bit of advantage in the paired end sequencing because you're gonna get perhaps some of those variants that wouldn't be, that would throw the aligner off because depending on how you set your thresholds for doing the alignment as we'll talk about. So the seed length, right, because you're gonna be able to rescue some of those. And in other cases like retrovial elements, paired end reads can be very important. So if you're interested in repetitive regions of the genome, right, and maybe some of these are the regions that you're talking about, we can leverage the paired end because as we'll talk about, if you've got a retroviral element in the genome, let's say this element is found a thousand, or a hundred times in the genome, when we place one read, so let's say this is read one, we place when the aligner places it correctly in the retroviral element, it doesn't know what retroviral element it is in the genome. And the way aligners handle that is they just report one randomly. So they'll say, okay, I know it'll be, it's mapping quality zero, it's in a hundred places in the genome, I'm gonna place it here, right? So that's how it deals with it. But if you have paired end reads, you can actually anchor it by its paired end, which is somewhere else flanking that retroviral element. And now we know that this is a uniquely mapped read. So we know that because this uniquely mapped read is in this place in the genome, we know that the retroviral element is actually this one because it couldn't possibly be one on a different chromosome or kilobases away. So that is some ways of using the paired end data. In the Castanius example, I imagine there must be some, probably more than 2% mapability. I'm giving you the example for a human genome. So you're gonna see a little bit better mapability, but I don't think it's gonna be, you know, it's... Partly you also probably just in this example, you can relax a number of mismatches. You can allow more mismatches and then raise your alignability with the paired end. Because as more mismatches you allow, then you can map the read in multiple locations, but this paired end, you can still rescue this. Yeah, so this gets into the question of sensitivity and specificity. So as we'll get into, you can adjust the map, how you tell the aligner, like BWA, how to do the alignment and how many mismatches it will allow and still place that seed. And as Misha says, you can leverage paired end reads slightly differently. That's not a common application and I would caution people doing that unless you really know what you're doing. I would just use the default, you know, to mismatches. When you use mouse train, which is not reference, of course you would expect that... Sure, yeah, yeah, yeah, yeah. It would allow a little bit more freedom for reads because you know that there is variations in sequence. Yeah. So there's somebody managing that. There is, but it's not double, right? No. So I would argue that, I mean, I think hopefully Misha will talk about some of the other advantages just in the concept of being able to predict where that nucleosome is in the genome which I think has value in it itself. Maybe if you're just windowing the genome into a KB region, maybe you don't care exactly where that nucleosome is. But certainly you're gonna get more high resolution in the nucleosome mapping because you actually know what exactly the fragment was that you IP'd. So I can say personally, at least in my own lab, I always do paired-end reads. By the time you've invested in the molecular biology, the sample and all the rest of it, it's an incremental increase in cost of reads but I think there's value in it and I think Misha will go through that beyond just the mapability question which is an important one, yeah. I think someone had a question, yes. No, they're different things. So mate pairs align out. So mate pairs are, I don't know if people still do mate pairs, do they? Well, listen, let's, yeah. So mate pairs were a way of making long insert sizes. So they actually involved a generation of a circular fragment. This is not related to epidemics. I just read it somewhere. Yeah, so mate pairs are not the same as paired-ends and if you align a mate pair library using standard paired-end alignment defaults, it won't work because the reads are actually, it's actually a ligated fragment where you're ligating together two pieces of the genome that are far apart and this is used for essentially chromosome, like genome assemblies and things like that. It's not used so much anymore because now we have long read technologies such as, sorry, such as Pac-Bio, for example, Oxford Nanopore, another example of long read technologies and there are other strategies now like 10X genomics where we're actually tagging individual, we're actually adding indexes and we're able to assemble context together. So not used as much, but at the time it was, it was something that was actually quite unpopular probably five or 10 years ago, five years, 10 years ago. Anyway, mate pairs are not the same as paired-end. Good question. Okay, yeah, so anyway, just to remind you that's how that works. Okay, so now we're gonna get into the actual file types. Oh my gosh, okay. Yeah, I'll be quick. So just, so we start with the images and as Guillaume just mentioned, the images are now deleted real time. When NGS first started, we kept the images. Some of us might remember those days. You know, we used to have a little tiny compute cluster, a three node cluster sitting underneath the image, underneath the sequencer and it was actually, the lore is, and this is what I heard, but it was a co-op student at Illumina who actually came up with the algorithm that does so-called real-time analysis where the images are converted in real time to generate the sequences and then that are dumped. So there was a time when we actually were asked to submit the images into the repositories, into SGA, or SRA, but clearly that's not scalable. And now we don't keep them, right? But again, that's something we can discuss over coffee. Okay, so two, I think there are two things that I wanna talk about. So one is chastity filtering. So this is something that you will see in the image files and this is one of the quality metrics that the Illumina sequencer generates. So what is chastity filtering? Chastity filtering relates to the question that was asked earlier about that orange cluster, that cluster that is a mixture of two different fluorescent nucleotides. So the idea behind chastity filtering is when you first start looking at those, when you first start generating that focal map, how unique is that signature or how pure is that signature that's being generated from that cluster? Is it one fluorescent signal primarily or is it a mixture of two fluorescent signals which might indicate that you've got a polyclonal cluster? So you can imagine that you could have two DNA strands that actually annealed right next to one another and when we generate that cluster, it's actually gonna be a mixture of two strands. That's problematic for a lot of reasons as I hope you can understand. And so this concept of chastity filtering was developed and essentially it's a very simple algorithm. It's the brightest intensity, so the brightest signal over the brightest signal plus the second brightest signal and it has to be greater than equal to 0.6 and right now or the most recent algorithm of this is it calculates it over the first 25 bases and it allows one failure. So chastity filtering is one of the flags that's part of the FASTQ encoding of the sequence name and is one of the quality metrics. So one of the important quality metrics that is generated from the sequencing itself. The other component which I'm not gonna spend time in but just very briefly is the concept of phasing and prefacing. So you can imagine if you've got a thousand molecules where you're incorporating sequence bases at every cycle, eventually you're gonna get or even at every cycle you're gonna get some that don't incorporate so you get out of phase, so these become phased or you get some that incorporate more than one base at every cycle and they become prefaced and so as you extend out a hundred bases you get more and more of these that become out of phase and so then when you're shining the laser and trying to determine what the signal is at that particular position, it starts to become a mixed signal and eventually it becomes impossible to determine what the actual primary base is. So that's phasing and prefacing and that's another characteristic of the platform. Of course single molecules sequencing technologies like let's use the example of Oxford Nanopore have a bigger problem because they don't have a, they're not making a consensus call of that cluster. They actually, they get one determination of what that sequence is and that's it. Whereas Illumina and other clonal based platforms you have, you know, you're basically taking a consensus call and that gives us the higher qualities that we see on Illumina platform versus single molecule technologies like PacBio, like Oxford Nanopore. They don't have that consensus call. So that's a real fundamental difference between the two. Okay, the other thing that's important to remember is that we encode base qualities and we encode base qualities using this log transformation. So it's essentially a probability and this is the Q score, the base quality score. This original, the original base quality or FRED score comes from Phil's revised editor. So that's where the name FRED comes from and this was actually developed as part of the human genome sequencing project. So when we started sequencing genomes in anger to generate the human genome, we needed a way of encoding quality into the basis. How do we know that that base is of any, that base call is of any quality? And so this concept of the FRED score was developed and again it's a log transformation of a probabilities and the probabilities are empirically determined from the data. So the initial selection or an NGS platform sequencing data was the qualities were based entirely on misalignments to the reference genome but as we did more sequencing, we were able to develop, we developed a probability score and then that's encoded in the genome. Okay, in the last few minutes here, so this is an intermediate file that I just wanted to show you just to kind of give you a feel for how we get from the Illumina file into the FASQ file. So just to kind of break it down for you. So of course the major components is the secret string itself that should be fairly self-evident and then the base qualities and the base qualities are here and the base qualities of course don't look like 20, 30, 40, 50, right? As you would expect from a FRED quality or from a FRED score and that's because they've been encoded in ASCII and ASCII is just a way of compressing two digits to one digits to save space. Of course we're dealing with files that are millions of lines long so it's a very simplistic way of transforming that into a basically into a way of compressing it a non-lossy compression. Now the information up here is the information that I did kind of going on about which is how we named this read. So we have the instrument, in this case the run number, the lane, so this is the third lane, the tile 1,101 and then the X and the Y coordinate, okay? So this is the information that's concatenated together to uniquely identify the read as you'll see as we get through the lectures. This is just the ASCII encoding. We use ASCII 33 which is the base encoding of most of all FASCII files now. There was a time when for whatever reason Illumina was using ASCII 64 and that's just the base that you use and you need to know the base to be able to convert from an ASCII score into a quality score. So if you looked here you'd see that H encodes 104 because this is one of the original Illumina, Illumina used to encode in 64. So it would be 104 minus 64 would give you a quality score of 40. So you'll see if you go back into some of the older data that's generated in the public repositories, you'll see that some of the older data is encoded within ASCII 64. All data from the last five years I think or so is in ASCII 33 which essentially is a different base. So ASCII 33 is the last non-character which is why it was used in encoding. Okay, so we put this all together to generate the FASCII file. So this is what a FASCII file looks like. It's got four lines. The first line is an at and then it's got a string which is essentially incorporating that information and we'll go through that in a little bit more detail in the next module. So we have the name of the sequence, the sequence string. There's a plus sign and sometimes this string will be repeated at the plus sign but typically it's blank. And then the ASCII encoded quality scores are there and then we start with the next sequence and so on and so on and so on. So FASCII files are four line files that encode the sequence and the sequence quality and the sequence name. Yeah. I don't know. You mean the plus line? I don't know why it was put there. I don't know. It's a good question. Why do we need this line, right? Because we're obviously, it's taking space. So the FASCII file was developed by the Sanger Center again as a way of encoding NGS data. When next generation sequence data was first generated it was all just FASTA files, no associated quality. The FASCII file was a way of collapsing the sequence and the quality together into a standardized format and then it was really just adopted worldwide and has become the standard. Again ASCII is encoded in base 33. So remember that, base 33 encoding for the qualities. Yeah, and I have no idea why there's a plus. I should maybe look into that. I don't know. Sometimes they get the sequence information so it's not just pluses. Yeah, well sometimes you get a repeat. The things I've seen is actually just a repeat of the name but then both, but why, right? Like why would you want to repeat the name? Maybe even numbers, right? Even number four, eight, but it's not number three. It might be, could be, could be, good question. Okay, so I think in the next module we'll go through this. So maybe it's time for a coffee break. I'm a little bit over time. So we'll talk about sequencing quality, library quality and IP quality as we go into module one. Yeah, so sorry, usually I finish but I'm just gonna say a few more words and then David's gonna go through module one which will include the first computational component which is running FastQC, which is very easy. It says a good introduction into the computational side of things. But I just wanted to briefly mention, okay, you've completed your chip seek experiments or you've obtained the data from a collaborator or you've done it yourself. How do you assess the quality of the resulting FastQ file? So really there are three main ways. The sequencing quality and that's what FastQC is a package that we use in the lab to look at the quality of the actual sequencing library. Of course, or the sequencing file, of course you've got a file that's got 10 million or 50 million lines long. How do you assess the quality? You need some way of summarizing the quality. FastQC is a tool that's used, has been used for many years to do this. It comes out of the Baybridge Institute, something that provides a nice summary. The other component is library quality. How do you know that the library itself is of good quality? And I was talking a little bit with Sarah at the break. Understanding the biology end of things or how the experiment was done, whether it was M&As digested, whether it was cross-linked, what kind of byte antibody was used. All of this is critical to your interpretation. You can have an experiment that was done two different ways and you may have different conclusions purely based on technical differences. So you really need to pay attention to this. And of course the IP quality itself. So this is the tool that we're gonna run. You can run this tool both using a user interface, but we're gonna run it using command line in module one. And it'll generate a file, an HTML file, which then you can go through and look at the outputs in the various categories that it generates. This is something you should run for all of your FASTQ files that you generate. There are other wrapper tools that can do this, can run FASTQC or you can run FASTQC across a whole number of different data sets. But again, it's important to understand the data type that you're running and the output of the FASTQC. So you'll see that there are examples where FASTQ has thrown a warning and said, wait a second, there might be something wrong with this FASTQC, but this sequence data. But it's because it's an IP library. So it has certain characteristics and you need to keep this in mind. Okay, so I spent a little bit of time talking about this, but this is something that we spend a lot of time looking at, which is DNA fragment length. And I just wanted to make the point here. So this is essentially a distribution of the Paradan reads once we've aligned them to the genome. And we can simply ask if we look, what's the distribution of insert sizes that we get? So now we're of course leveraging Paradan reads. We spent some time talking about this, but of course we've got our sequence fragment with our, we basically read from either end. So we've got the ends. We can align those back to the genome, as we'll talk about, and then generate an insert distribution. I just wanted to make the point that, so here you can see the input, which is this black line here. So there's a whole series of experiments that are run here. And you can see that the input is nearly a Danik ball. The input shows what the expected distribution is, which is around a single nucleosome, right? That would tell us that our MNAs digestion was near completion, in the sense that what we got back was essentially a single nucleosome, around 150 base pairs. For those of you who are interested, we also sometimes see these shoulders of smaller length size, and you probably will see this in your data as well, especially MNAs data. This is something that I've been trying to get a student to look at for many years. It's quite interesting that we see these shoulders of a smaller size, just by MNAs digestion, which might be reflective of some other nucleosome structure, but I'll just leave it at that. Okay, so this was the input, and then we did IPs, which either anti-H3K4 trimethylation or anti-H3K27 trimethylation. And the point I want to make here is that we see a shift in the distribution. So actually shifting the distribution of the input up, suggesting that, in fact, we have a dinucleosome form here that's being enriched with the H3K27 trimethylation. This is a typical trace. Again, there are many, many experiments that are run here that you'll see in an MNAs digestion. So you're actually enriching for a larger fragment size because of the IP. And Sarah was just discussing that in her lab when she does H3K4 trimethylation, when she sees a shift up, it's a representative that perhaps the data quality isn't as good. So taking a look at the insert sizes is a critical component of the quality. This is obviously after sequencing, but there is technology such as the Agilent Bioanalyzer that some of you might be aware of is something you can use before you run the library to actually look at the insert size and see whether you're seeing a shift or not. So insert size is something. And then there's a whole series of metrics that we can use to actually say, is the IP quality of good quality? And this isn't, I wouldn't say this is a solved problem. One of the things we do at our center is to compare it to a compendium of existing data to try and get an understanding of where our library is on the distribution. Because of course, we don't have a gold standard in chip seed, we don't know what the answer is. And so the question is, well, how good is the quality of the library that we're sequencing and how confident can we be in that data? And again, Misha's gonna talk about this in more depth. So obviously the amount of sequencing that you generate is an important component. You wanna be within the range that I talked about. If you're sequenced too lightly, or if you're sequenced, say, five, 10 million reads, you're gonna have a much harder time distinguishing signal from noise. Again, Misha will talk about this. And so sequencing to the appropriate depth is important. The number of reads that align uniquely to the genome is another measure of quality. Typically, well, in some cases, oftentimes, if you have a poor quality chip-seq library, you'll see that you'll get many reads that align repetitive regions of the genome. This is a characteristic indeed of other types of IPs, such as midiff libraries. So looking at what fraction of your reads line within the unique lineable region of the genome is important. Another library quality measure that you can derive from the primary alignments is PCR duplicates. So here is a whole distribution of libraries. So this is libraries that we've generated in the lab from low PCR duplicate rates to some that are high PCR duplicate rates. So if you're getting in the range of 10 or 20 or 30% PCR duplicates, you likely have over-amplified your library and the quality of the resulting data is gonna be impacted. And finally, another way to look at it is to look for expectations. So I've told you some characteristics of where we expect the reads, so you can leverage that information. So for example, for H3K27 acetylation, we may say, well, I know that K27 acetylation is associated with enhancer states. So if I just simply ask how many of the reads that I align, align within known enhancer state regions, and I can pull these down from resources such as UCSC genome browser, from the ENCODE consortium, from the IHEC genome portal, you can pull down all of the enhancer states or a set of enhancer states that have been annotated for either your cell type or a related cell type, and you can simply ask where, how many of my reads align with the most enhancer states? And you can see, if you only got a few percentage, you might have some concerns about the quality of the data. And of course, another way to look at it is within gene bodies for H3K36. You expect the majority of your H3K36 sequences to align within gene bodies. If you find that you have no enrichment in gene bodies, then you might have some concerns about the quality of the data. Okay, so, and again, we're gonna spend quite a bit of time talking about metrics. I just wanted to introduce that really briefly. And now David is gonna start module one where we introduce the compute Canada node, and then run through fast QC.