 All right, well, it's great to get to know everybody and I just wanted to say that not everybody at UBC is called is named Martin. We do have some other names as well, but it's my joke for this morning. All right, so I think we've already been through in terms of the license. I'll be teaching the next couple of modules in partnership with Edmund. And really the purpose of this first module is to kind of frame the discussion that we're going to be having around chip sequence, chip sequencing analysis. So this year spiked in a little bit of a tax seek because in previous years, many students have been interested in learning a little bit about a tax seek and as you'll see the attack seek workflow is almost nearly identical to the chip seek with the exception of some some details that we'll get into around how you process the data. So with that, I'll get started. So, you know, I'm going to be covering a lot of ground, I mean, covering a lot of conceptual ground work in the in the next few lectures. And I'm also going to be talking about some pretty deep technical jargon. I think it's, you know, it's, it's best to ask questions as we go along. I won't be able to monitor the slack channel. I will be able to see hands I think and of course there are many other fabulous folks on the line who can point out if there are any questions. So really encourage you to engage in the material as it's as it's coming and and raise your hand if you have any questions. Be present in the virtual classroom and have fun. So, I know this is cliche, but there are no dumb questions. If you have a question I'm sure someone else in this room has the same question so feel free to shout it out and that allows us to sort of cover material that we that we don't have time to cover or that I have I don't have in the slides. And also you know that the slides here are really meant as a as a resource for you to take notes from. You do take notes and the slides will be available or are available. I did upload them on the Google Doc. I don't know. I guess that gets pushed out, or maybe it's already available. So, so you should have a copy of the slides as well. All right, so what are we going to be talking about today. Well, we're going to be talking about chip sequencing and and a little bit of a taxi because I said, and you know really what what has enabled us to be able to look at measurements of functional regulatory in the genome has really been the development of next generation sequencing platforms and in particular second generation sequencing platforms. So these are platforms that are able to sequence small fragments of DNA, highly multiplexed. And it was really the development of these technologies in 2005 2006 that really launched the field of epigenomics and for myself. I got interested in it. I was working on one of the first selection instruments at the time called Illumina working with someone by the name of Tony Couserides who was a Cambridge, and he had done some chip, and I had the sequencer, and we did some of the first chip seek libraries that were done using histone mods at the time. And, and really I got hooked as soon as I started seeing the data aligned on the genome browser. So the two technologies that we're going to be talking about here today are number two chromatin immunosupputation, and then a flavor of open chromatin analysis on both use some methodology or all use some methodology to share up the genome and we need to share up the genome as we'll talk about because we need to generate short fragments that'll that allow us to to sequence on a on a second generation sequencing platform. And it just so happens that those fragments are about the same size as the wrap of DNA around a nucleosome. And so it actually makes a pretty nice marriage between a molecular application and a technical application. So the learning objectives of this first part of this of this module are really to understand the approaches and utility of epigenetic measurements and genomic based research and all really high level briefly touch on on why we're interested in studying this area at all. And then we'll talk about the principles and challenges of chip seek and attack seek analysis. And then by the end of this lecture you should be familiar at the high level with the chip seek and an attack seek workflow and selected quality measurements that we use, and then really in the next module. We'll get into some more of the technical details of the workflow itself. So hang on, I've got to close the screen. There we go. All right. I can't see it. All right, so, you know, why are we interested in epigenetics at all. You know, I think most of us on this call probably have some understanding I'm assuming of what epigenetics is but, you know, you know, for me, you know, there are many diseases, human traits, and, and indeed memory itself that in part maybe encoded in the epigenetic state. And these are examples I use in class but you know I think I think we really, you know, go across the breadth of those, those types of concepts. So, so first really thinking about in you in utero exposure, of course, in times of in times of famine. There have been a number of studies that have that have revealed what appears to be evidence of some type of transgenerational inheritance. One being the Dutch hunger winter that occurred in the Second World War, where babies or women that were pregnant during that time were subjected to severe caloric restrictions something on Europe about 300 calories a day. And the babies that were born from women that were pregnant during that period of time did of course have low birth weight. But of course, once they were born, the famine was over and the children grew up in a normal normal environment. However, as those children aged, they showed higher prevalence of many common diseases such as cardiovascular disorders, diabetes, psychiatric disorders, and so on. And, and perhaps even more surprising is that the children or the grandchildren or the women who were pregnant during that period of time are also showing this high prevalence, suggesting perhaps that there is some sort of epigenetic mechanism or some sort of transgenerational inheritance that's occurred from the babies that were born during that period of time. And of course, I think, intuitively, maybe this would make sense if we think about epigenetic mechanisms as a way of sensing the environment and preparing the organism for the environment in which it will be, in which it will be born and raised, and perhaps, you know, preparing for famine and programming the metabolic state to be ready for a famine and then being born into a time when it's not a famine, and having basically a metabolic program that's not aligned with the actual environment. So we have traits and diseases. So the middle panel here, I guess I could put on a little laser pointer here. In the middle panel here, of course, we have examples of traits that align or that are very highly correlated with genetics and so we're looking at identical twins, those are twins that have an identical genome and fraternal twins, same environment but not identical genome. But for some traits, it's highly correlated with with genotype. In fact, almost all of your height, for example, can be described by, by your genotype alone. However, for most common diseases including psychiatric, psychiatric, psychiatric diseases neurodegenerative diseases, and many cancers. The link to genetics is actually very low, suggesting that there are some, there's some other feature and of course that's potentially epigenetic mechanisms that occur. And so there are many examples of identical twins, for example, where one of the twins will have leukemia and the other will not. And again, leukemia is a type of disease. It's a, it's a type of cancer of the blood. And, and leukemia is often, if not always involving some level of epigenetic dysfunction in DNA methylation. In fact, memory I think is another very interesting and emerging area. In this particular case, I'm showing you here a fly model where short term memory or long term memory can be measured. It turns out that flies. Remember if they have, if they have made it or not. And if you knock out an enzyme that's involved in mediating epigenetic mechanisms in this particular case, a methylation of H3K9, trimethylation or dimethylation, this can inhibit both short term memory that's the red bar here and long term memory suggesting that perhaps epigenetic mechanisms are playing a role in in the development of memory. So, you know, what do we know about epigenetics well we know that you know you have about 300 cell types in your body and as far as we know, each of those cell types is packaged. Each of those cell types has unique packaging and this packaging provides a regulatory roadmap for cell type specification. And we, myself and many of the members of the teaching team here have been involved in generating reference maps as part of the International Human Epigenome Consortium. We've been involved in this work since, well, for over over a decade now, initially with the roadmap consortium so many of you might be aware of this paper. So this is our first publication of reference epigenome maps, where we took a set of in chip seek data sets, and then used a hidden Markov model to then predict chromatin states across the genome. Jason Ernst and Nollis Kellis and many others. And then we painted the genome with these with these chromatin states as a way of layering the regulatory information over the genome so each row here represents an individual cell type and we've labeled the cell types here on the on the left hand column. So this was an initial pass at about 111 reference or 111 human cell types, and subsequently, we continue to expand that as I'll talk about in a minute. So what are these histone modifications. Hopefully every most people here are aware of them but just to remind us that they come in a really two main flavors. Activating modifications. So this includes, for example, H2K4 trimethylation H3K4 monomethylation that marks active and promoters and enhancers respectively. And H2K27 acetylation which marks active enhancers and they also come in repressive states. So this includes, for example, polycomb H2K27 trimethylation. H3K27 trimethylation is one of the modifications that's most often disrupted in many common diseases, including many cancers. The nomenclature that we use in the chip seek field or in the epigenetic field just to remind everybody there's quite a bit of information encoded in this name. So the first two characters referred to the histone in this case histone H3, the amino acid that's modified so this in this case lysine four so the fourth amino acid from the n terminal tail of the histone H3, and then the modification itself so in this particular case trimethylation. So H3K4 trimethylation, H3K4 monomethylation and so on. So you can see that the same lysine so H2K27 can be acetylated, and that has a particular meaning, for example, is associated with active enhancers. But the same lysine can be methylated in this case trimethylated, and that can be associated with a repressive chromatin state. So at this set of histone modifications you may ask why these were selected. So these were initially selected as part of the roadmap program, and they were selected as a set of histone modifications that best describe the epigenomic state. And many, many more. In fact, there is about 100 different histone methylation or histone modification events that that have been measured using orthologous technology such as mass spec. And as part of the roadmap program. There were a number of groups that had done a full panel of histone mods about 30 or 40 of them. And then we reduce those down to a set that were the most informative and so these are probably the marks that many of you are aware of. There are of course many others and some of those other marks have subsequently turned out to be quite informative. But that was why the this set of five or six core histone marks were selected. And as I just mentioned, you know, these have now been used as a larger cohort and this is what this is ongoing work to generate the next. What we're calling the epi Atlas, which is a follow on from from the roadmap initiative and the blueprint initiative. And this is what we meant to provide a reference framework for comparative analysis. And, you know, stay tuned. The, these data should be made available to the publicly in the spring of 2024. And this just sort of gives you an idea of the framework the bioinformatics framework that was built to to provide a unified analysis of this data so we're collecting data from a large number of consortia around the world. And then running it through a standardized data staging and data processing workflow to then generate a set of tracks and a set of chromatin states that are available that are then made available to the research community. Now why do we have to go through all this trouble you may ask. Well, because we're dealing with human subject data, human subject data is protected and so it's not the sequence level data is not made available to the in the public in the public repositories. And so we have to come up with a mechanism that we can process this data in such a way that the genotypes remain protected, and yet we can then share the output of that. What we have done this is essentially to generate a series of containers. So these are packages, or essentially packaged workflows that would allow you then to reprocess your data using the same analytical workflows and then compare it to, to the the compendium of data using that has been generated using the identical workflows. So we have the data that we'll be discussing and working with in the next three modules as come from the IHEC consortium. And this includes two types of data sets. One is MCF 10 a so this is a cell line this is a mammary epithelial cell line. And you can see the details here is isolated back in 1984. And this MCF 10 a is is open access so it's open, it's an open genome and so this data is is the data that you guys will be analyzing because we can share with you the fast queue files or the raw sequence files from which obviously the genotypes can be called. We will be comparing this or admin will be leading a module comparing this to data sets that have been generated through from protected genomes. This is actually data that was generated as part of the IHEC consortium, and is generated from reduction mammoplasty material where we've sorted for for mammary epithelial cell types and, and I think we'll be taking one or two of these cell types, and then comparing it to the MCF 10 a. And the protected genome data sets will be shared with you in the format of what we call a bed file. So this is a reduced or this is a representation or a transformation of the data sets that allows for comparative analysis. And if you're interested in learning more about this work and and and the Canadian consortium that's driving it, please go to to this website here. Okay, so there are many, many different approaches to chip seek. So, there are chip there are chip seek mechanism, there are chip seek protocols that use cross linking there are chip seek protocols that will cross linking followed by sonication or some methodology for sharing. There's a variety of chip seek that uses an M&As activity there, there are, there is chip seek that uses transpose on and so on and so on and so on, you know, indexing prior to IP indexing after IP, etc, etc. These, these workflows, or the data sets that emerge from these workflows or the fast queue files as we'll talk about, can be processed using the workflows that we'll talk about over the next few days. But we're going to be talking about or I will be talking about in depth a little bit of more about traditional chip seek so I'm not going to be talking about indexed strategies, mint chip, etc. But just remember that those workflows or those data sets can also be processed using the same workflows. So, what is chip seek. Well, chip seek is a sequencing based approach and used to quantitatively measure histone modification patterns in the genome. And this panel on the right hand side really shows a typical representation of how we interact with this data, at least manually or from a visualization perspective, and you'll be getting, you'll be introduced to how we generate these tracks over the next few modules. But just to remind us this is a genome browser so this is the UCSC genome browser developed by Jim Kent for the Human Genome Project, where we can then lay features or what we call tracks. So each one of these rows represents a different track. Each track is a chip seek experiment from an individual sample. And the height of the histogram here indicates signal density and that signal density is a transformation of the alignments that are generated from the from the IP material. So what what can you see when and then of course we've caught well we've colored these different colors depending upon which particular histone we are targeted in the chip seek. So for example, each three K four tri methylation is at the top here, and we can see these nice peaks marking the start sites of transcriptional or the transcriptional start sites. For mono methylation K 27 acetylation and blue here, and then we can start to see some of these larger, what, you know, sort of more broadly dispersed marks like K 27 trimethylation in brown here. And, and then DNA methylation shown at the bottom which which you know, and others will get into, I think in tomorrow. So what you see from this data where you can see the patterns of the chip see you can see the relationships between the marks you can see that some marks are correlated some marks are anti correlated. You can see if we look at DNA methylation which is shown here, and the height of the histogram here indicates the level of CPG methylation you can start to see relationships indeed between histone modifications in this case, K 27 trimethylation and DNA methylation that there's a pattern here that seems to mirror the occupancy of K 27 trimethylation, you know, showing the well known relationship between these two marks. Okay, so what does the the chip seek analysis workflow look like this is why we're here today. We're really at a very high level looks like this we start with our IP. So this our chromatin meter precipitation and input fast Q and I will be talking about what a fast Q file isn't in a few few slides. The fast Q file is the input and so this will be the file that you get from your sequence provider, when you do your chip seek experiment. And that fast Q file be labeled by whatever the, you know, your target chip was, and of course your input. We use a short read a liner and we'll be talking about what what these are in a few slides. So for this module will be using something called BWA, which is one flavor of a short read a liner. We'll use that a liner to align our reads to the genome, and that will generate a SAM file, which will convert into a binary format called the BAM file. So we're going to use some tools to then sort and do mark those files, and then pass those into our peak caller called Max to, and the output of Max to will be a bed graph file and bed files and those are the file types that we're showing represented here so that's really the workflow for chip seek. In addition, we're just going to really briefly touch on a few aspects of a tax seek. So what is a tax seek a tax seek is a way of generating open chromatin data. So data that represents where in the regions of the genome that are nucleosome free. And we know these nucleosome free regions tend to be enriched in regulatory elements. So these are regions of the genome that are bound by transcription factors or that occur at open promoters where nucleosomes are phased. There are many different ways of assessing open chromatin. The first methodologies were DNAs one sequence, DNAs one sequencing really, I guess, production eyes by John Stamelopoulos's group near us down in Seattle. There's also M&A's seek where we just use a mononuclease to digest the genome. But then we end up with these set of fragments which we can then build libraries. A tax seek is different than DNAs one and M&A's seek in the sense that it uses a transposon to insert a DNA tag onto the genome that allows us to lift that out and generate the specialized molecular libraries that we need for sequencing. So analysis is very similar. In fact, you can take an attack seek library a fast queue and run it through essentially the identical pipeline that you run chip seek through the major difference being that you don't have an input control. So as we'll get into when we're calculating the significance of regions of enrichment, we typically or it is certainly recommended that we use an input to measure the background signal at the region that were genome wide but certainly at the region that we're calling the peaks in. For a tax seek we don't have that input because what would that be, it would have to be basically shotgun sequencing the whole genome. So we use the the attack seek library itself and essentially randomly just disperse it around the genome to calculate the background signal. So that's one major difference. And the other, the other differences we'll talk about is that you can use a an optional step following generating the BAM file to take into account the transposon or the transpase transposons. The the TN five read shifting. So this is actually what we're going to go through and what Edmund will go through in in the module is accounting for essentially a nine nucleotide shift that is introduced by the TN five, introducing the barcode or the tag into the genome so when the transposon cuts the genome, it actually takes out a nine base pair segment to add the essentially the handle that we use for the PCR to lift the library out. And in the end code attack seek workflow, we actually shift the reads over to account for that insertion event or that deletion event. But this is not something that you have to do, but it is something that we're going to work through in our module. I think I can see a question. I don't know I can't see the questions if somebody has one. So what are some key considerations for chip seek. Well, first of all, antibody specificity and sensitivity. As you know chip seek involves the use of an antibody to enrich for DNA protein interactions or the protein that's bound to the DNA. So antibody specificity and sensitivity is key. I would also be thinking about which marks you want to profile depends on the research question depends on your particular research question some, some I would say favorites in the community include K 27 acetylation that marks active enhancers and tends to be one of the most dynamic features of the epigenome also polycomb K 27 trimethylation this tends to be a mark that is often disrupted in human diseases including including cancer. And you know what are what are some required sequencing depth how deep do I need to sequence a library. And again this very much depends on the particular experimental question, and on the particular target that you're working with. But the, you know, the minimum recommendation that's emerged from from the I had consortium, along with encode and other groups is recommends 50 million repairs for punctate marks so this would be. This is called H three K four trimethylation as I showed you very punctated transcriptional start sites, and then for broader marks like H three K nine trimethylation about double that so about 100 million read pairs. So that's read pairs. I'm doing single end sequencing which used to be a thing but most people do paired end sequencing and I would certainly recommend that if you're going to do chip seek that you use a paired end sequencing workflow. And as we'll talk about this allows you to accurately measure the ends of the fragment and predict the fragment occupancy, which much with much higher specificity than you can with single end data. But if you're doing single end data, this would be 25 million fragments or 50 million fragments for the broad marks. Okay, many potential biases and chip seek analysis and understanding these are is critical for understanding these biases are critical for. Performing, performing chip seek analysis. So really the adage of, you know, garbage in garbage out. You need to know what the material is going in the sequencer because that material is going to be. You know, that's going to influence your, your, that's going to influence the, the output. Okay, and again I can't see I can see that people are asking questions but I don't know if, if you want me to stop or I'll just keep going. All right so reagent specificity is central to a successful chip seek experiment. And I would highly recommend that you, if you haven't generated the data that you asked the data generator what's what specific QC metrics were run for the antibody, what antibody was used. It is best practice to in any publications to include both the catalog number and the lot number for the antibody. We spend more effort and more time qualifying the the antibodies than we do. Then, well, we spend more cost qualifying the antibodies than we do than the antibodies actually cost. And so, and we typically will qualify a catalog and lot number and then purchase the entire lot if it's a polyclonal antibody, so that we have a consistent stock that we can use for subsequent IPs. So, suffice it to say take some time to qualify your antibody before you start your experiment before you start your chip seek experiment. And if you're receiving data from, from some other group or you're reanalyzing data, really ask the question of, of, you know, what type of qualification was done for that antibody. And we'll talk about some of the ways that we can do that downstream from the alignments. But also, there are molecular tech techniques that can be done, done upstream to that we use is essentially a peptide array that shows specificity you guys probably can't read this. But essentially this top bar shows us that the, that the, that the correct peptide target is the most highly enriched, but you can see that there's also off target effects for this particular antibody, a polyclonal antibody with a hydrogenode. And then of course we do a Western blot and this is probably one of the worst Western blots I've ever seen but you can see that that there is a specific band at only a single band. And we look for at least 50% of the signal for a given antibody to be in the correct location in, to qualify an antibody moving forward. So that would really be the first step and then following this we do a pilot chip seek run and compare to existing data sets to, to ensure the quality of the, of the antibody. All right, so how do we actually do chip seek getting into a little bit more details here. So of course we start with our genome our genome is made up of nucleosomes that wrap DNA these nucleosomes containing internal tails that are modified. So in this particular case, we'll just pretend this is H3K4 trimethylation. We share the genome up and we can do this in lots of different ways we can use transposons in like mint chip or we can use, we can cross link with formaldehyde, and then we can use transposon to share it or you know the technology that we have adopted in in our group and certainly as part of IHEC is to use native chip seek so using an M&A sequencing strategy is very well suited for for for histone modifications, and certainly would be the one that I would recommend you do. If you were doing a chip seek experiment. In the genome we have essentially nucleosomes with DNA associated to them. We then use an antibody to recognize the modification. So then we enrich for fragments of DNA associated to proteins that have the epitope that our antibody recognizes. This becomes our material for a library construction or IP material. And then we take a little aliquot of the genome that we've sheared up and however we've done it M&As or sonication, or so on. And this becomes our input material so input and IP those are the two outputs from from a chip seek experiment. And then you then, well now we have to sequence it. And so how do we do that. Will we add barcodes that we will we add essentially handles on the ends of the fragments of DNA that allow us to sequence it. This is a work or this is a diagram from from an Illumina system. But of course this is also true of complete genomics or ion torrent systems, or many of the other systems that are coming online over the last 12 to 24 months. Essentially your DNA insert is what is the material that came through the IP or through your input. We add adapters onto them that allow for sequence the sequencing primer to anneal so this is read one sequencing primer. This is read to sequencing primer, and at the ends we add adapters that allow for cluster generation on next generation sequencing platforms. In the Illumina flavor these are the so called P five and P seven. And then importantly between these two adapters we have what we call the index or the barcode. And for all chip seek experiments that you'll do they'll all be indexed, and the index itself is just a short, typically six nucleotide or can be longer. So this is a DNA that allows us then to pool many of these libraries together and sequence them all in one lane or one, you know, fragment of a, of a, or, you know, one flow cell, and then allows to split that downstream. Most of you probably won't be involved in using the index to do the splitting, but it's important to know that it's there and some of you may indeed have to to leverage it. So we've generated our library and we have our barcode we have our barcodes and we have our adapters on the end of it. What do we do next. Well, there are really three methodologies that have been developed for clonal amplification so now we have to take those fragments, and we need to amplify them to allow us to then sequence. And there are three main methods that the first being the oil aqueous emulsion like a, you know, essentially a salad dressing you're essentially making many many little small reactors in oil. This is how 454 and I am torrent do it. This uses a bead to as a, as essentially a surface to amplify DNA on. And, you know, and again, is quite effective. I don't have time to go through all the details. The other way to do it is using a solid surface. And this is the technology that really launched Selexa and then Illumina, and probably the technology that you'll most likely engage be engaged in with chip sequencing although that's changing. And this uses essentially a solid surface or a microfluidic slide to actually do the cluster gen as I'll talk about in a minute. And then finally rolling circle amplification which was initially developed by complete genomics and has been has been advanced by MGI or BGI for the for the MGI systems where instead of using a solid surface we essentially circular lies your chip seek experiment. We use a rolling circle amplification to generate clonal copies of the initial fragment. So once we have those, you know, once we have our libraries how do we actually physically make these clusters that allows us to sequence. And so, again, on the Illumina system, we have a slide that looks something like this, although it depends on what system you're using. And there's these addressable lanes each lane, you know, can generate hundreds of millions of read fragments and so that's why we need to index our libraries before we load them. We load each of these lanes individually in this particular workflow so you can see that the flow cell looks here, and we essentially flow our library over the over the surface of the slide. And the fragments are captured by DNA molecules that are grafted onto the onto the glass onto the flow cell surface that captures the either the P five or the P seven sequence that we've added to our added to our adapter. Or that we've added to our insert, and that provides us a template to then do to do sequencing. So I'm not going to go through bridge amplification or how we actually generate the clusters, but happy to discuss that in the break or in slack if you have specific questions. So the sequencing itself is is fairly straightforward these days it's essentially sequenced by synthesis. So we essentially start with our sequence. That is, you know, in this case has a purple all ago so to be the P five all ago for example that's hybridized to the flow cell surface. And this provides a priming an area where we can prime for DNA synthesis, and then we add the nucleotides and in the alumina sequencing chemistry that the nucleotides are are protected so they have a reversible reversible blocker on them that allows us to then add nucleotides one at a time this is really one of the main innovations on the alumina system. And we add one nucleotide at a time so here we start with a T that that hybridizes to an a we detect that signal using a laser that allows the die that's connected to the to the T to to illuminate. And then we have the next, and so on and so on, and we build up the sequence by adding one nucleotide at a time. It's blocked then deprotected the next one is added is blocked to protect it and so on and so on. So we generate a series of images, and starting with the first image we can generate what we call a focal map, where which depth which provides a structure that that aligns where the clusters are on the flow cell surface with a two space and I'll talk about why that's important in a minute. And then from that focal map we can then build up the sequence so TG, CT, AC, one at a time so building the sequences we go and that's why it's called sequence by synthesis. So that just gives that that shows an illustration of that. So how does the barcoding itself work so impaired and sequencing we sequence and from either end of the read so read one sequence starts here this is your insert again. This would be your IP material. So our read one sequencing primer anneals to the red thing here and we start this this read, we then strip that sequence off, and then we do an index sequencing primer. So we actually anneal a primer that anneals to the opposite strand of the read to sequencing primer, and then we read into the index, and then we strip the entire thing off and we actually regenerate the cluster. And then we do bridge amplify bridge amplification again, and then we read into read to sequencing primer reading into the opposite orientation. So, so we're generating multiple individual reads, or multiple individual workflows that generate multiple fast Q files as I'll talk about. So how does the base calling itself work we start with these intensity files that are the images that we're generating from from the sequencing as we go through the sequence by synthesis process. This generates what's called a BCL file, this is a binary base call file. So this will be the file that that is used to then convert into the file type that we'll be working with today, called the fast Q file. This is the BCL file here because for some of you who might be working in the future, or maybe even now with single cell data so for example, using a 10x, you know single cell attack seek data for example, you would actually interact with the BCL file, and you'd have to use a wrapper or a, a what what is known as the BCL to fast Q conversion that would then generate your index split from the BCL file itself, we won't be going into that today, but just to remember that the BCL is the file before the fast Q file, and if you're using a single cell technologies that you'll need to use the BCL file as the input to generate your fast Q files. So, you know, as we generate sequence data, you know, we need to have some way of assessing the quality of the base calls that are generated office off the platforms. So this is, this is obviously a very tricky thing to do since we have no way of knowing what the gold standard is. And so something called the Fred score was developed actually as part of the human genome project, and it was actually developed on the forerunner of second generation sequencing, called Sanger sequencing so first generation sequencing. And it essentially is this is it essentially a way of providing a probability that a base is that the base itself is called incorrectly. And so the formula is here base quality is the negative 10 log transformation of the probability that a base is incorrect. And you don't need to understand the details beyond knowing that this is empirically determined. So the way that this has been determined determined for the Illumina system and and other platform is essentially re sequencing a known sequence so a synthesized sequence, over and over and over again, and then generating essentially a series of observations that allow us to then assign a probability for that base. I didn't need the example here it's a little bit harder to conceptualize on the Illumina platform but for this for the Sanger platform, for example, it's how the peaks were spaced as they came off the back end, etc. So same principle, we use the same quality metric for for Illumina sequencing, or for next generation sequencing which is the Fred based quality score. So, how, how do we actually, how do we go from these images to to sequence data and what, why do I keep harping about the, the where the clusters are generated on the flow cell surface. And the reason for this is because you can imagine if you're generating an experiment where you have millions or hundreds of millions of individual sequence fragments. You have a way of uniquely naming each of those sequence fragments. And so that's actually done by looking at the position of the cluster in in two dimensional space. So I talked about that focal map that's generated before we start doing the sequencing or during the sequencing process. The focal map is actually composed of a essentially what we call a tile so we break a little piece of the lane into a chunk into a two dimensional piece where we have a Y coordinate and X coordinate. This is obviously very highly dense and each of these little clusters that you see here is then assigned an X and Y coordinate. And so that allows us then to uniquely name the read as the flow cell ID, the lane it is on in this particular case this is eight lanes, they can come in many different flavors. And that is the X and Y coordinate for that particular read. And that is then used as the unique name for that particular sequence that is then carried through the rest of the analysis, and that unique name is absolutely required to allow us to to for example indexed reads or or paired and reads because it's the it's the one handle that allows us to relate the sequences to one another. So how do we report out the sequence reads that come off the Illumina systems or or any next generation sequencing system. This is called a fast queue file, and this is really an evolution of the fast file that many of you may be aware of so the fast file is how we, we store for example, reference genome sequencing sequences that you'll be using. The fast queue file is slightly different in that it contains your, your sequence but it also contains the Fred quality score. So that quality score that log negative log 10 transformation of the probability in the read. So a fast queue is the universal standard for encoding next generation sequencing data. This is with an ad character followed by a unique sequence name as we'll talk about it contains your nucleotide sequence string. There's an optional third line that is sometimes replicates the information in the first line but most often is blank. We have, we encode the base qualities, and the base qualities are, as I said from Fred scores but you're looking at that and you're saying that doesn't make any sense. It's just a bunch of glyphs. And those glyphs are actually asking coding. So it. So we essentially use asking coding in the fast queue file as a way of simply compressing a two digits into a single digit. The only thing you need to know to convert from the fast queue file glyphs into a Fred score is what the basis and the universal standard for fast queue is base 33. And so knowing base 33 you would you would look up whatever the symbol is let's say for example a nine you nine equals 57 so if you if you look in the table you'll see nine is equal to 57 you minus by the base. In this case base 33 to get you a Fred quality score of 24 at equals 64. So you can look for the glyph of at I can't see it here I'm sure it's here somewhere. Here it is. So at equals 64 minus 33 gives us a quality score of 31. So, luckily, and you know you will not likely have to do these conversions but I did want to did want to introduce this to you because you'll see these in your fast queue file. And if you're ever wondering what they are. These are the base qualities for for each of the bases as they come off. So, note that, as I said that during the sequencing process, the read one and read to are generated using into are generated independently and they result in independent fast queue files. That's also true for your index as I said, and you know in, I guess, you know, more recently we've even moved to a dual barcode system, where we have indexes on either end so some of you might be familiar with with with two barcodes essentially the same process except we're reading the index of both the five prime and the three prime end of the fragment. So that index and read one and read to are all associated to each other using the unique sequence name that's generated from the X and Y coordinate of that of that of the original tile. So, so that's how we are able to relate read one read to read three together. What does that look like. So here's an example of a sequence name that comes off an Illumina system, where we can see the lane here so this is just the flow cell number. The lane here, the tile 1101 and the X and the Y coordinate, and then some optional information sometimes containing what read it is but. tile X and Y coordinate. And so this is actually what it looks like in a fast queue file. So again we can see this information and the, and these are two independent fast queue files. And you can look at these yourself on on the AWS data repo. You can look for the for the sequence names which will look slightly different than this, but but the principle is exactly the same. And the way that we relate these to each other is again through this sequence name. And it's critically important that before you start your alignment before we move these into BWA that the reads are read names sorted so that the names are, or that the reads are sorted by their their read first read in the read one fast queue file that you use. And the first read in the read to fast queue file you use has the same name. It's just different in terms of whether it was read one and read two, and so on and so on and so that order has to be retained. Otherwise BWA will choke. And of course, you know the the as as we'll go through, we'll use different types of sorting of reads. So this is read names sorted, but the other common way of sorting reads once they're aligned is coordinate sorted. So we actually sort them with respect to their coordinates on a genome or on a reference. So it's called coordinate sorting, but before they enter BWA they must be read names sorted. And this is the default that comes out of out of an system. But some of you might, for example, be given a BAM file so given an alignment file and have to convert it back to a fast queue file. And if you have to do this, remember that you need to read name sort your sort the outputs before you then go into a real alignment step. Okay, so we've completed our chip seek sequencing run I've given you some sort of high level details about how the sequencing run is done. How do we assess the quality of the resulting sequence file. And so there's really kind of three main areas to focus on versus the sequence quality, and Edmund will go through using a tool called fast queue, see that that that allows us to assess sequence quality. The second is library quality so how how well did the library sequence so what are some metrics that we can use to assess that. And then finally the IP quality and I'll talk about metrics that we use there as well. So sequencing quality can be can be measured in a number of different ways a very handy tool is fast QC. It's a very easy tool to run. And this will provide a series of metrics for your experiment against a series of benchmarks that you can look at and give you an assessment of how well the sequencing run worked. You can also get this information from your sequence provider so whoever you did your your chip seek sequencing library for they'll they'll be able to provide you the overall mean quality scores etc. So you should pay attention to those. And of course those quality scores are encoded in Fred quality, Fred quality system or framework. How do we assess the overall library quality. Well the main way we do this is to look at the diversity of the IP fragment. So when I mean diversity, I mean how many how well does the library represents the genome. And the way that we do this is is we look at what's known as the PCR duplicate rate. So the number of PCR duplicates that are that are present in the library of course PCR is used during the chip seek workflow to amplify the fragments that have added our transpose on or, or after we've added our adapters and we use PCR to generate sufficient fragments to move on to the sequencing step. And we'll, we'll talk about how you calculate PCR duplicates once we go through the alignment module, but just to say that, you know, there is a wide range of PCR duplicate, or you will find a large range of PCR duplicates anywhere from a few percentages essentially showing just a random set of libraries that were generated in my lab from, you know, starting from the left side to the right side and increasing in PCR duplicates so you'll see that there's, there's a range of duplicates that you see in the way we look at it as we say well, you know, here, here are some, you know, here are the three new libraries we've sequenced, how do they compare in terms of their duplicate rate to the distribution of this, of this other set. And are there, you know, if they're on the, on the high end, you know, that might be something that we that we want to take a look at. So, for example, this library here would be one that we would consider to be, you know, likely either a poor quality or fail because it has a very high duplicate rate. So there are two types of PCR duplicates that are that are that are generated. So the, the, the first type is a PCR duplicate that is generated from during the library construction. Remember that this PCR duplicate is not a result of clonal amplification that happens on the flow cell surface. This is, you know, the cluster generation process itself is essentially clonally amplifying that first PCR fragment. It's the step before that when we're making the library after the IP where PCR duplicates are introduced. We define these after alignment as reads that have identical start and stop positions with respect to the reference. This is how we define them. And again, what we'll get into the details of that as we get into the, into the hands on component, but suffice it to say PCR duplicates defined by identical start and stop relative to a sequence. So, so why, why do we care about these duplicates. Well, you know, the clearly these duplicates, you know, can introduce experimental artifacts. But of course, they could also be a consequence of real chip seek signal. So for example, if we had a region of the genome that was highly occupied for whatever our target is let's say h2k4 trimethylation, we might get actually, you know, real what biological duplicates so they're actually individual fragments of the genome that we've sampled more than once. So they they're actually pulled down in the IP more than once, and are really truly biological duplicates. And these can occur and they do occur in regions where we have high occupancy. Now, you know, the, these are the good kind of duplicates and they're actually met, you know, we, in order to be quantitative in the measurement of that particular feature. We would need to know what those, you know, we would need to take those into account. However, for most chip sequencing experiments, we don't have a way of determining whether the duplicate that we're seeing is a good duplicate or in other words a biological or a bad duplicate in other words a duplicate that's arose through PCR, PCR amplification. There are methodologies that you can use to tag the fragments as you're doing your IP so you can actually add a molecular barcode onto the end of the fragment. This, this can be done. And, but it does introduce some additional challenges in the in the in the alignment process. Again, happy to talk about, but it allows you to do what what's called you am I marking so unique, unique molecular index. So just think about it in the way that if you would, if you attached a unique barcode after you did your IP before you added your adapters, that would give you a unique barcode for that individual fragment that you're that you've generated, and then you could use that you might as a handle to to disentangle PCR duplicates from from biological duplicates. And again, if you're working in single cell space, you'll know that you am I's are attached to, for example, a taxi fragments using on the 10x platform and single cell RNA seek so you actually use the single cell RNA as a way of determining true biological duplicates from from from PCR duplicates, but for the for most chip seed workflows this is not done, as it introduces a number of experimental challenges and actually building the libraries, and also in computational challenges which I'm happy to discuss. And the take home, I always like to give you a take home fault PCR duplicates are removed in for downstream workflows. So, you know, this is, you know, probably the work, the workflow that you'll do and certainly the one that will will follow in the module. So, but it really depends and I you know is something that you need to think about as you as you work on your data. Okay, so that's library diversity. Really, we're looking at PCR duplicates as a as a surrogate for library diversity. What is the other quality metric that we can use. And that is an IP quality. So how do we know that the IP itself enriched for features that that are enriched for features in the genome. There are two common ways we do this. One is called a FRIP score. So FRIP is the fraction of sequence alignments that that actually are called within peaks. So we're going to use max two to call enriched regions in the genome as we'll get into. And then once we have those enriched regions we can simply ask the question well, how many of the sequence of the original sequence reads actually aligned into those into those enriched regions is the higher the number of the higher the fraction of reads that aligned into the into these enriched regions or peaks, the better quality of the data or the more signal to background you have for that particular library. Another common metric that's generated is domain reads. So this requires a little bit of foreknowledge in terms of what the expected signal signature will be for the for the IP or the expected occupancy for the particular mark that you're looking at. So for example, if you're looking at K 27 acetylation, you can you can take a set of enhancer regions that have been defined by let's say the encode consortium. And then you can ask yourself well, or you can compute how many of your enriched regions overlap with with known enhancers, and obviously the higher than the amount of overlap, the better the quality of that or the more better the quality of that particular IP. So again, like all of these quality metrics, these scores are widely distributed. And I'm just giving you this example so that you you can get a feel for for how widely distributed they are. These are data sets that are generated as part of the IHEC consortium. And this are FRIP scores for a particular mark K 27 acetylation. And you can see that they range anywhere from, you know, on the low, you know, maybe 10% or even below, all the way up to to over, you know, over 75%. And, you know, again, there are many reasons experimental reasons, some of these cell types are very rare are much harder to work with other cell types. And I would imagine that most of these on the right hand side are probably from cell lines, which are much more easy to work with and generally have a higher IP signal to noise, but but all of these data sets are, you know, potentially, or all of these data sets will generate signal that will be able to determine using max to, but you will have to take into account for example if you're comparing data from a very high FRIP score to a very low FRIP score, you'll need to take into account the differences in the false negative and false positive rates of those samples. Okay, so those are the chip seek quality scores and those are the ones that will be computing as part of the module so we can get into a little bit more details as we get there. And finally, you know, just a tax seek measurements. Well, what are the kinds of things that we can use to look at the quality of an attack seek data set. Well, you know, there are kind of two main areas one is to look at the fragment profile so if we align the reach to the genome, and we look at the distance of the or the length of those fragments that were generated. And I would expect to see a representation of, of, or at least a pattern that that mirrors nucleosome content so single nucleosome dying dying because I'm trying to include zone, etc. And of course, the frequencies are going to drop as we get to the larger fragment sizes, because these are limited by the sequencing platform. So we can go into the to the lower into the into the single nucleosome regime, you know around 150 160 nucleotides. We can actually start to see this DNA pitch. So you'll start to see this stutter in the fragment length distribution that you can compute from the alignments, and this stutter is an indication of a high quality attack seek library. You know, uses similar strategy so as domain reads, and we can look where in the genome are alignments are where in the genome are alignments and are enriched. And of course we expect that a tax seek reads will be aligned in regions, such as promoters enhancers, etc. And so the, the, the enrichment of the signal within known feature sets of a tax seek is also a way of assessing the quality. Okay, finally, I just wanted to talk about one last thing in, and this applies to both the tax seek and chip and chip seek. And that is the so called blacklisted regions. And this is actually a project that Anshil Kandaji at Stanford, really started on in the end code consortium. And this is actually a filter that we'll be using as part of the workflow. So what the heck is this. Well, it turns out that, you know, in the reference genome and certainly this is true in the human genome and also true in most mammalian genome. The assemblies are not complete as I think we know certainly for hg 38. And there are regions of the genome that are represented only once in the assembly that you're using, but actually in the genome that you've sequenced the represented multiple times. So this is a repetitive region in the genome that say present 1000 times in the genome that you're sequencing but only present once or twice, or even less in the reference genome. And what these look like is you get these very large, what we call read stacks that occur in regions of the genome that are not related to signal intensity but rather are related to an alignment artifact, due to the fact that that sequence is, you know, assembled or represented only a single time as I mentioned in the genome. And we need to remove these because these are actually quite prominent. They can cover up to, you know, a few percentage of the genome. And of course when we're looking at a differential analysis, we may be working in that same regime of just a few percentages. This shows you for for genome genomes here so you know can be quite prominent in the mouse here it's you know almost 7% of the mouse. If we do a correlation between a set of data sets and look at the correlation when when we don't filter the peaks we can see that a lot of the correlation is not driven by biology, but rather is driven by enrichment in these blacklisted regions. And so we can once we remove that then we start to see by the you know biological signal much, you know, which which with much higher correlation values. Where are these in the genome these as I said tend to be in regions of misassembly or regions where there's gaps or satellite repeats etc etc. So best practices to remove is to filter against the blacklist blacklisted regions to remove these prior to your analysis. And again we'll be doing that as part of the as part of the workflow.