 So, my name is Yusuf. I am from Toronto. I work at the Ontario Institute for Cancer Research. When I started, I worked for the sequencing production team, and I was responsible for the assembly of the wide variety of pipeline, and one of them was doing sequencing. So I did that for three years, and then I moved to another team right now where I do more analysis and do biometric discovery, algorithm development. And at the same time, I still come up with QC tests and metrics that we can use to assess the performance of a lot of libraries and samples that we work with. So today we're going to be continuing with the harnesseak lectures. And in this module, I'll focus on the alignment aspect of RAC, and also visualization, how to visualize those aligned files. So some of the objectives that we want to learn at the end of this module is some of the challenges. So go over the challenges that we face when it comes to our harnesseak alignment. Go over the different alignment strategies that are available and what suits your delivery data. Also go over the tools that we've picked for the assignments and for the tutorial, which is a combination of both time and top hat. And also talk about other alternatives. And I'll go over how the band files or the alignment files are formatted for those of you who have not seen a band file before. And how you can manipulate the band files as well. Again, as I said, we can go over also harnesseak visualization through AGV. And another section that I'll be doing is the QC assessment. So some of the questions that you asked in the first module about how much sequencing do we need, how can we tell if the library is good or bad. I'll get to cover those more details. There I'll show you what are the metrics you need to be aware of and how you do proper assessment. And finally, just a small section on band and read count. If you're trying to look for variants in harnesseak data, then I'll show you how to do that as well. So what are some of the challenges that our harnesseak data analysis face? One of the biggest ones is the computational cost. I mean, as sequencing technology is advancing, we're getting sequencers that will give us more reads, more depth. And also the length of the reads are increasing. And with that comes our computational cost. So as was mentioned earlier, one line of sequencing with the harnesseak 2000, just about 300 million reads. I have seen actually more reads now. There's a high sequence of 500. I've seen actually up to 500 million reads per lane. So that's a lot of reads to process. And another layer of complexity compared to the DNA with RNA actually have introns. So the challenge of trying to identify the intron exon splice sites and exon exon boundaries. And they don't really have the DNA sequencing. And with RNA sequencing, you would get a lot of data. Sorry, yes. Sorry, that's a very nice question. But maybe with RNA, you don't have the introns sequencing. You don't have the data. Exactly. But I'm talking about the challenge that you face when you try to align it. And we'll go over that. So the RNA sequences, they do not have any tried regions. But when you try to map them, that's one of the challenges that you face. And in addition to expression, RNA-seq is more than just expression data. I mean, RNA-seq is expression. You can also call variants for RNA-seq. You can do splice variants. You can do fusion detection. So in each one of these different data types, they come with a package or a list of packages and tools that get updated often. So you can't just pick one tool that will do everything for you. You will have to use a variety of tools to get all that information. Throughout the talk, you'll also see lengths from Biostars, which we'll introduce you to earlier today. So here, for example, the question is, top hat, the only map we'll consider for RNA-seq data? I'll go over some other options as well. But if you go there, you'll see a discussion of what other aligners are available in the community. So when it comes to the strategies that you can pick for the alignment, there are three main strategies that you can choose from. So you have the de novo assembly. You have the option to align to the transcriptome preference. Or you can align to a whole genome reference. Now, how do you pick? So for de novo assembly, it's good if you do not have a reference genome, a non-reference genome. So if you're dealing with a species that doesn't have a reference genome, then you go with that option. Also, it's a good choice if you have a complex polymorphism and mutation, and you think that if you align to a reference genome, you might miss that completely. So you would go with de novo assembly. Unfortunately, we're not going to be doing any de novo assembly in this course. Instead, we're going to be focusing on alignment to a whole genome reference. What you can also do is you can align to a transcriptome reference. So you assemble all the possible isoforms into one reference, and then you take a reason to align them to the transcriptome preference. And that usually works well if you have short reads as well. With a whole genome reference alignment, you probably need longer reads. So usually what we do is we do caret and 2 by 100. That's what people do nowadays. All data sets, they used to be shorter. I've seen 2 by 30. I think that's what we'll be using to create an assignment later today just for the sake of speed processing. But you might also see 2 by 50 for older data sets. But most of the data sets that you'll encounter, the recent ones are 2 by 100 at least. So this is a list of aligners that are available on the timeline when they were released. So in terms of RNA, there are a few aligners that are available. I personally have only worked with top hat. And top hat is very established. It was published by 2009. And it's been used in many publications. It has a very well-documented. And the fact that it's been in use for such a long time means that a lot of the questions that you were thinking of, we will probably see an answer in Firestar or any other, or on their website. So there will always be an answer for these questions. Now the other aligner that is pretty recent is Star. It's known to be a lot faster than top hat. I personally haven't done any comparison to check the output and see the concordance between the splice engines that are generated from top hat versus Star. And I'm not aware of any papers that do such a thing. But for this workshop, we're going to be looking at top hat reviews and top hat. We'll also have Star as an option. You can run it if we have time. Now back to the question that was brought up earlier. So introns. Introns are a challenge because we're trying to not breathe that do not contain introns. There are many sequences to whole genome reference. And the whole genome contains introns. So if you're aligned to old genome, you need to pick a tool that is splice junction aware. And so top hat is splice junction aware mapper. And it uses bow tie as its main aligner. So what it does is it takes the reads and then tries to map it using bow tie. And then top hat takes the mapping information from bow tie and tries to assemble those reads to come up with splice junctions. So I'm going to walk you through an example of how this is done. So let's say that you have two reads, read x and read y. And this is the reference. And the reference here we're looking at two axons, axon one and axon two. And let's say that read x actually spans two axons, while read y only spans one axon, and they're working the same line. So what happens in bow tie top hat is that the first step is aligned. So it takes all the reads that you have in your data and then tries to align it to the whole genome reference. So in this case, because read y spans axon y, it's going to perfectly align to the axon with no trouble. It takes that and puts it in the aligned bin. Then read x. It's going to try to align it. But then it will align to axon two with so many mismatches because of the entranic region of the whole genome. So what it does is that it can struggle with it and puts it into an unaligned bin. Then it takes all the reads that are put into the unaligned bin and then it splits them into smaller chunks. So if your read was 100 bases, usually the default is 25 pixels. So it usually splits it into four, 25 segments. And then it takes each one of these segments and then it does the same thing. So again, bow tie, which is a short read aligner, will take each one of these and then it will map it again to the whole genome reference. So in this case, x1 and x3 will map perfectly. x2 will map because it's standing again in splice junction. So it does that and then top out, it collects all that map mapping information from x1, x3 and x2. And it uses that to come up with a dictionary for all the possible splice junction sites in your data. And then it goes back and tries to realign your original read according to that dictionary that it came up with. Yes? Why does the server need to bin there? Why does the server... Because so previously, if you did not break the read into smaller chunks, you're going to end up with half of the read matching and then the other half is going to be completely mismatched. It will be matching. It will not because you cannot split a read. Oh. So it will say that this many bases are inserted or mismatched. And the point of doing this exercise is you're trying to figure out where the boundary is, where exactly is that site where the splice actually happens. And you're aligning it with what? The reference genome? Whole genome. If there are the short reads, then you have to be sure of making a lot of them. So what you want is you don't want it to go too short. So it will keep that in mind because if it will map to multiple places, then it considers the other parts that it's not. But the shorter it will be, the less unique the alignment will be. So you don't want to go too short. The short reads that are the different colors, the short ones and the short ones. They will align. Yes, yeah. They'll never align to anything with the splice. Yeah, because of the splice. But what it does is it will go back and re-align all these original reads. So the purpose of just doing that is to detect where the splice happens, but it will go back and take the original read and then map it to the dictionary that we came up with. So the mapping information for these small checks is not used in the final map. It's just a way to detect where the splice is happening. Yes, that's true. Yeah. Sorry, could you repeat the discussion tonight? Yeah, so you were saying that the other thing was a small exon that is 50 bases or less. Is that correct? How long ago? Yeah, I think it's a small exon. Yeah, between the two. But then you also have alternate splice and another thing, right? So you're saying you have three exons, one, two, and three, and they're very strong. So now, what do you have? So the same thing happens. I mean, your read has to be long enough to cover the two splice sites. And if it doesn't, then you will miss it. And then keep in mind that you will have a lot of reads not just when we will go over that site. I mean, the whole region should be covered with enough reads to cover the two junctions. And if you're not covering these two, then there's either you should go back and check if it's a technical error than you haven't covered that sequence, or maybe that isoform is not expressed. And that's why you're not getting the incident. Because this is just an example, right? I mean, in real data, you'll have long transcripts. Oh, for sure. So yeah, so this is a very basic example. I mean, I think on average, there are five exons, four or five exons per gene. And then you'll get different combinations of exons. So depending on the isoform that's being expressed, sometimes it's actually one, three, four, sometimes it's one, four, five, and so on. So any more questions? All right, so one other issue that you might face is the multi-napped reads. So these are reads that will map to multiple locations. With DNA sequencing, a lot of times what happens is that you end up picking the top or the meat that has the highest mapping quality. You just keep that. But sometimes you also end up with multiple reads that have the same exact quality. And then, then you have different options of dealing with that. So you might randomly pick one of those reads and use it. And RNA analysis, I mean, you face the same issue. The way you deal with it could be slightly different. It really depends on what you're doing. If you're trying to do variant columns from RNA sync, it's preferred that you only keep one of these reads that not enough. But if you're doing expression analysis then you end fusion detection. It's good if you keep all these different reads because they could belong to different homologues or gene homologues in your genome. And that might alter the expression if you actually get rid of those reads. So after you online, you end up with a SAM file that stands for a sequence alignment map format. How many people have actually seen a SAM file or worked with a SAM file? Okay. Good, good. So the BAM is just the binary version of the SAM file. So to convert from SAM to BAM, you can use more of them as SAM tools. You can take it from SAM to BAM. And that will save you a lot of space. It will reduce the footprint of the file. You can always go back to the SAM if you want to from your BAM file. So based on the SAM between the EAN and R, anything? Yes. Same formats, same files. I mean, you also get other files out of the files that tell you the spice junction of where they are and I believe they have the solution deletion files as well and some secondary files to the output. But the main file that you're interested in, the alignment file is exactly the same format as DNA. So this is an example of a SAM file for those of you who haven't seen SAM. And it's split into two sections. You have the header section and the body, where the header section just contains information about the run, the sample, and the body of the file contains the actual sequences that were aligned along with their position and quality. Okay, so since you know SAM and I'm not going to cover that, but I'll just go over the header. So this is the kind of information that you will see in the handover. So you will see the tool, this example will be taught at, and the version number that you used to align. You'll also see the reference file that you used for the alignment and the list of chromosomes at the beginning and end, all the context. And the reads, you'll see the read group identifier, where they will contain the name of the sequencing center and the sample name, and then the program version. In the body, you'll see the following tags that I mentioned. You'll see the read name, eStore, highlighted, I'll go in depth, the flag and the cigar in a bit. You'll see the read name, the position, the mapping quality. So you get a map of quality for the read itself, and you also get a breakdown of the base quality for each one of the bases, what was the quality for that. You also get the length, which in this case would be 101, and the sequence, the actual sequence, that was complete. So one of the items that was highlighted was these sand or band flags. So this is an 11-bit-wise flag, which describes the alignment and each one of the reads gets one of those flags. So they're just a binary strength, and their length is 11, length 11, and it's a nice way. Instead of having 11 columns in your file, it's condensed into this one number. And it will keep you information about whether or not, for example, the read was unique, was it mapped to multiple places, is it the first read, and the paired end with the second read, was it not pass quality controls, is it ECR artifacts? And so how many of you have actually tried to interpret or use sand flags before? No. Okay, so that's good. So now you'll take a chance now. If you check this website, this is a nice website where it gives you a form, and then you tell it what you want. So what description, what information you want for your read, and they tell you the flag number. So why is this useful? Let's say that you're looking to a band file and you're trying to pull all the reads that are uniquely mapped. So you only want reads that are uniquely mapped in your band file. So actually you can use sand tools, you can use sand tools view, and then you give this flag to sand tools, and we'll go and look through all the reads and pull the ones that have that score or that flag number. So can we all try and go to this page right now? And let's say that I'm interested in picking the reads that are second that are mapped properly, and I want them to be in read one. So I want them to be the first read, and I want that first read to be mapped properly. So what would that number be? So yeah, so the two criteria that I want from my band file, I want the read that is mapped properly, and it's the first segment. So read one first. 66th. Okay, good. So with this, the link, you're not taking it out directly, the link will be found in module nine, lecture. What? So the 66th is represented in binary. So you can put that number in the sand tools, you can say sand tools view, dash F, put that number and put the name of the band file, and we'll only pull reads that are uniquely mapped, and they are read from that body. So I just wanted to get the flag for reads that are uniquely mapped, and they belong to read maps. They're just read maps, not read to me. So I'll link both of that flag, both of that number. So this form that I'm pointing at right here, it helps you, instead of you saying that I'm coming up with that number, it actually profits, of course. So you can say checkbox, you check what criteria you're interested in, and then calculate that number for you. And then you can take that number, and as I mentioned, you can use sand tools view. So there are two parameters in sand tools view, dash F, small F, and dash capital F. One that only includes, so it pulls all the reads that fit that criteria, or excludes this criteria. So you can use that to create a subset, or like a smaller BAM file if you want, that only includes reads with the criteria that you specify. So one example is that you're trying to filter the BAM file after the alignment, and you want to do some downstream analysis with it. You can take, if you want, just uniquely mapped reads. Then you can pass it through this filter, use this, and then get the filtered BAM, and move on and do other things with it. Cigar string is another thing that you'll find within the BAM file. Oh, it's 66, I think, right? Oh yeah, sure, that's a good idea. Oh, here it is. Okay, so this is the, oh, there. So what I wanted is I wanted to be properly paired, and I wanted to be read one. And then the number, the flag that I got was 66, so I can take that number, and then I can run it through SAM tools to subset my BAM files. I don't know how accurate the PCR... Oh, have you tried the PCR flag? I haven't tried it. So I don't know how accurate it is in detecting PCR artifacts used in the alignment. That's correct, yeah. Okay. So yeah, so this Cigar string is another string that you'll find in the BAM file. And this Cigar string actually tells you how it was to read along. And it tells you that kind of per base will describe the alignment. So whether it matched, for each base within your read it didn't match, it wasn't a mismatch. Whether it was an insertion, deletion, or was there a gap in the alignment. So if you take this example, you'll see this has been through today's Cigar string. Let's say it was 100 bases, you'll read it was 100 bases. This is saying that the first 81 bases in your read matched, then there was a gap of 800 and 59 bases, in their reference. And then the last 19 bases matched. So this is important because this Cigar string gets used by a lot of the tools that we passed with BAM file 2. One example I can think of is GATK, not our Narni, but DNA. It tries to look for indelids. It means that indelids and it tries to do real alignment around indelids. And this is one source where it looks for indelids, or potential indelids through this Cigar string. It parses this for the Cigar string. But other tools take advantage of the Cigar string to find any patterns of the solution of indelids. Lastly, let's take a look at some of the other Narni users. This is not a Cigar string. It's not a Cigar string. It's not a Narni rackets. It's saying that we match more of this match. Let's say N, at the right top. Yes. You're saying in rackets that we match more of this match. I thought it was just a match. For them. I'm pretty sure it should just be a match. But now the question is, how is N different from the equals? Say something like alignment match versus sequence match. Yes. So sequence match, maybe it checks the original sequence. I don't know if it checks the fast queue sequence that you passed it. I'm not quite sure. It's just fast queue sequences to make sure that the base is the same as the... But it probably sometimes aligns sequences, even though it's not a cool match. So the 81 match M, that means that it has aligned the 81 bases, even though maybe they're not on these matches. In this sequence. They'll be dashes, right? Yeah. Yeah, we'll be using different base. So it should be dashes. Is there a threshold for how many groups? A threshold. I think you can all specify the number of mismatches when you're talking about alignment. But in terms of the gap, it really depends on the alignment. I'm trying to think if there is... Because at the top, you actually provide with an expected distance between the leads. So that's something that you would give top hat. So the question is, now does it look at that distance and anything that's beyond that distance would use it as a pressure to build the similar strength? I don't know. But there is a number of mismatches. That's a question that we're discussing here. And sometimes you don't want to look at the whole map file. So with the map file, you will contain information with only chromosomes, the whole transcriptome data. But if you're interested in specific axon or specific gene, what you will end up doing is that you will use what's called a map file. And you can use that map file to subset your map file as well. So let's say that you're only interested in genes from the map file. And you want to create a subset or you want to look at the coverage across that gene. So you're doing this coverage or the axon in that gene. So what you can do is you can create a little text file called bet file. And the bet file contains information like the chromosome of the region you're interested in, the start position and the end position. So it doesn't have to be a gene. It could be anything in the region that you're interested in in the genome that has chromosome start and end. And you can use tools and the tools request a bet file so that you can look at the coverage across the regions that you specified in your bet file. And those tools like SAM tools, you can have SAM tools, bet files to do that. It is BAM tools that record all these tools except bet files that can be used to manipulate bet files. If you have two bet files, you can look at the overlap between the two bet files. Say you have two regions if you're doing target sequencing using two capture methods and you want to see what the overlap between the two targets are. So you'll get one bet file from one target or another bet file from another target. You can use bet tools or bet bumps to calculate the overlap or come up with an intersect or unique regions in one versus the other. So there's so much you can do with bet files using these tools. So after the alignment what you usually do the first step is sorting the BAM file. So there are two ways you can sort the BAM file. You can sort the BAM file either by the coordinates or you can sort them by the reading. And depending on what you're doing after the alignment sometimes you need to sort by position because you don't need any processing using another tool. They require you to sort by position because it will make it easier for them to grab reads by position. However, if you're trying to look for something like the distance between two reads that requires the two re-planner reads to be in the same exact order, the reads then you need to sort by read name instead of position you might lose that the order of read one and read two. But if you sort by read name the fifth line and the first BAM read one will match the fifth line and the second of read two BAM file. And again with cost visualization you can use IGV where you can load UCSC G track load your BAM file the coverage track on the top will tell you how many reads cover those regions and you can log transform that track. So here in those charts you present reads. Any gray line represents gaps in the alignment and the color of the read you can also change the color of those reads to reflect the quality of the reads so that's something that you can do. So again there are a couple of 5 stars discussions but it's alternative to IGV there's a list of tools that you can use. I use IGV not to assess the quality of the BAM but for example we're interested in a gene that's differentially expressed we get a list of genes that are differentially expressed the first thing that we do is go back to IGV just to make sure that the regions around those genes looks okay and it wasn't picked up by due to a technical error. So look at those genes you can actually do two panels if you're looking at cancer in normal you can do two panels right above each other and do the right comparison in terms of the coverage and the expression between the two genes or the two regions. Before I get into the QC assessment any questions so far? No? I hope that means that everything is good. So the second part of this module is the QC assessment. So here I'm going to be talking about the main part is that we look at to whether or not a ledger needs to be re-sequenced or needs to be sent back to the lab. So I'm going to be talking about the 3.5 bias how we find out how that affects the actual analysis. Look at the nucleotide content read quality, base quality distributions QC or artifact how to deal with it sequencing depth the base distribution and the insert size distribution. So one of the first things that I check for when I look at the RACC and RACC library is the distribution of the coverage across the genes. So these plots of J by a tool called RACC I highly recommend it it's really easy to use. So what this does it takes the met file your genes sorry, the top 10 the top 1000 express genes in your file and then it looks at the the genes they'll have different lengths for sure then what it does at that normalizes each one of these genes by splitting in 200 bits so that we're trying to do is we're trying to normalize with our lines 100 to 100 bits from the 5 prime to the 3 prime and then you look at the coverage for each one of these bits the coverage distribution and you're assuming that the top 1000 express genes will give you a good representation of the overall coverage distribution for the RACC genes. So in here what we're seeing what we're seeing about it we're seeing some samples that have so on the x-axis you have the 5 prime to 3 prime the gene coverage sorry position and then on the y-axis you have the coverage so these samples over here they seem to be performing well there's even coverage across the transcript however these samples they seem to have a huge 3 prime bias so there is so much coverage at the 3 prime of the transcript over the other end so this is something that you have to be very careful with do I normalize or correct for it should we send it back to check this could be because of the kit that you're using to prepare a library so you might consider changing the kit but you have to look into it correcting it might not be the easiest thing to do so it's better to go back to the root and figure out what caused this if you keep it then it will for sure influence the expression estimation of your data because very long genes will be severely affected the expression will be underestimated while very short genes the expression will be overestimated because of this bias so one way you can also check for it is after you do the expression you can do a plot expression versus gene length if you see a correlation between the expression and gene length increasing especially in the doubt then that's another confirmation that maybe there is some sort of bias in terms of the coverage so that would be the first thing that I would check for coverage distribution is it possible that that means there are two groups caused by the different expression that I wrote? no because here I'm looking at the top thousand expected genes whether the chances that the thousand genes are in one network doesn't make sense if you were looking at maybe a small subset then even if you're looking at a small subset you shouldn't be seeing even coverage that is different if there is a biological difference then the whole expression should be about it just shouldn't be a case of general direction so this is the problem mapping so this is the question what is the input for the backfiles for these yeah so I think the backfiles use the I think they use GTF files which you can download from you can download that it's just a file that lists all the genes start, end, and and I guess you can provide with your own if you don't have it you can't find it but you can make your own any other questions? yes so the backfile is used to look at all the genes because they look at the top thousand expected genes so it's just the way to annotate the regions but it actually looks at your data file can I upload the backfiles? well you're not uploading you can run it on your local machine yeah so you need a backfile and you need an annotation file both the backfile and the annotation file the annotation as I said you can download from VCSC and the backfile will be available and this is not included in the FASTQC report? I don't believe so they should condense that one because I think FASTQC is designed because FASTQC is an alignment you can run FASTQC after alignment but I think FASTQC was designed to provide quality metrics for the FASTQ at the FASTQ level which does not include an alignment and it's something if you don't really want to reuse the stool you can calculate on your own you can just look at the coverage for the genes at the two ends and as I said if you don't want to do it at that level you can do it at the expression level by looking at the read length so after having this analysis we can drop some sample by this particular sample doesn't match or having really bad library and I should drop it is it after this or how do we go through our alignment this particular sample is really messed up and I should drop it how we can make some conclusions in this example the question was at what point do you decide whether or not you should drop the samples and not continue with them it really depends on how severe that I mean there isn't there isn't a certain one number that I use as a threshold when I assess the quality of the library so I don't go oh this number is more than 0.6 then I'm going to drop it completely so when I assess that I look at all the metrics that I mentioned and then I make a judgment after looking at all the metrics you can train like if you want to train your model so you can look at all these metrics and then look at everything you've done and what you've failed and what you have flagged past and then come up with some sort of statistical model to train to train it so that every time you pass the sample it will do it for you but for now I mean if you don't have that expertise you can just look at it and then assess how severe it is this to me is a severe but again you look at multiple metrics not just this and then see I mean if you do that special analysis or preliminary special analysis see if there's any batch effects or that batch stands out compared to the other ones and if it does then that tells you you should really consider doing something about it and if you really really really can't do anything about it then at least you should let people be aware of such a thing so that when they include it in their data analysis they can at least incorporate this into their models and then and try to correct for it somehow the second metric that I look at is the nucleotide distribution so you would expect after sequencing to have a uniform distribution of the four bases ACGT there shouldn't be no representation of the base for in your reads at least so one of the plots that we generate is the position of the reads so from the first position to the hundredth position or in this case it's 35 bases long and then the nucleotide frequency so you expect them to be uniform and long with a 25% level here you see sometimes a pattern where the that actually is distributed ACGT distributions for the first 10 bases and that is actually expected there's a paper by Lona it turns out that it's the random primers that are used they tend to cause these certain patterns at the five ends of the reads so one way you can deal with it is you can trim the first few bases of your reads so that it doesn't affect your read so again plot it see if you actually have such a thing if you do that trim the first few bases and you will notice that if you do have this problem it will be systematic the sample is not in some samples but in others if they're all done with they all had random department done or the alumina then they should have that issue that's a good question, I don't know it will probably affect your alignment with the impact of this on your alignment so it might affect your alignment because if the bases are not correct they are not their systematic bases then it might increase the number based on mismatch so yeah so by trimming them it has, I trimmed them it didn't has but I don't have number 4 by what percentage it enhanced the alignment so is it like 6 base pairs of the primers all the time because I see this all the time so I do that now so now that's something that I do but keep checking but it's a sequencing thing so you will see it in all the samples so it's actually like 10 like if you see here it's going to 10 but I don't want to provide a number because it could be different depending on the technology that you use or the sequences that you use so first check before you decide which bases to trim sample by sample or you apply that sequence to all your samples do it automatically because most of them have but if some are not then you are trimming yeah so some samples don't have that problem then I would be concerned because it should be a sequencing thing and not a sample that you should but this is loaded for the primer it's loaded for the primer project the primer is not sequenced now so you are actually getting rid of the well it's the random primer that causes this issue but these are not the sequences that you see I think it's affecting the the bases more than it's not even the actual primer but it's affecting the distribution of the bases that are being synthesized or being added what is the level one we do it's it's a library it's not a sequence but it's a library that's why you do the random I see before the primer sometimes the empty content is a dc content but you are looking at the overall content here we are looking at positions at the beginning of the read not the overall content this is systematic it always happens at the start of the read you can look at the quality distribution we are looking at the distribution of the quality curve base so the base alignment quality and again on the x-axis you have the position so base is 1 to 35 this is the 35 base read and on the y-axis we have what's called the fret quality score so you will get to here that often when you deal with quality scores and alignment to life file fret quality is simply the negative 10 times log 10 of p where p is the probability that the base coloring is wrong so for example if you have a fret score of 30 it means that you have a 1,000 chance that the base coloring is wrong the higher the fret score the better it is the last chance is that your base color is wrong so usually people when they assess the quality they say anything about 30 is considered good so you look at reads that have an overall quality of q30 and the x40 and then we look at the distribution you'll notice that here the quality is very high overall everything is about q30 which is good but you see a slight drop in the quality and that's completely normal it's just a way that sequencing it accumulates error as you go so by the end of the week your error profile is a bit higher but you shouldn't expect a huge drop where it's below q30 because that will be a very high error profile at the end of the week and you don't want that so if it's really really low at the end of the week then you consider also trimming your read at the end so you only include part of the read that has high quality and most of the cyclic platforms this is in the high-speed of part 8 and it will give fret score 30 so in case we don't know is there any way that we can find out what is the quality score or fret score of this particular data you mean the type of the quality score or the quality itself is there a way you can find out what the quality score is if you don't know what the sequencing is well I believe there are two different types of quality scores I think you're talking about the fast I'm talking about when the particular data is to read 33 class or 64 class so I think for that you have to look at the distribution of your quality score and then look at the minimum of the max and see if the max matches I don't know but also what you can do is a lot of tools nowadays they are able to recognize that because they follow the distribution and then they tell you they predict something else we talked about earlier in the day the PCR duplication whether or not you should collaborate with duplicates again one of the metrics that we look at is the number of reads per start and ideally you want to have one read for each start so here what we do is we just take all the reads and count all the reads as specific start points and then see how many we have so what you want is you don't want reads that pile up the same start and end you want reads that actually overlap the region however in RNA this is not actually according to the transcript started so it's not completely random just like DNA but the tool does generate a plot for you that checks both the position so the beginning of the first read to the end of the second read and also the actually the sequence itself checking the sequence is exactly the same and then generates a plot so anyway then we can visualize this PCR or transcript artifact and then based on that it's up to you whether you want it to collapse or not collapse but if you have something that is extreme that you might want to consider oh sorry so so this is the number of reads trying to remember the number of reads so how many times you see reads that are exactly the same so you take a read and ideally what you want is you want reads that are so most of them you only see once but for example you see some of these are 300 300 times but I thought the line axis is sorry number of reads but this is how for me that's like a thousand thousand the y axis or is that the base the size of the read no no no I'm trying to remember but this is what we want our plot to look like you have to know how it was generated because I'm trying to think that that plot actually the curve goes like right here no I remember what actually I can we'll check right now it's the duplication time so oops the first column is occurrence duplication time times so how many times that sequence was duplicated that exact sequence with the same start and end how many times that sequence was duplicated sorry so there are two y-axis there are two y-axis they tell you the number of reads of your total reads just one so those are unique reads and that's what you want you want to maximize the ones that are low so you want this whole thing to be really low because you want to minimize the number of reads that are duplicated but when you get a bad library you'll get something like this where at high occurrence or high duplication rate you will see a lot of reads at that rate yes so the duplication rate is measured two ways they look at the position for the mapping duplication and then they look at the sequence for the sequence duplication so they look if the sequence is exactly the same or just the mapping the beginning of the read and the end of the read because you could have mismatches introduced so usually they're correlated I mean if you have the mapping and the sequence they are correlated this is saying that 11% of the reads are in the range of about one unit but I think it's a cumulative rate like you're going to have to add each 100% each type of ticker okay yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes So expectations is a big factor because here we have the expectation that we're dealing with human data. I am dealing with a line-in-line AC data of a certain length, 2 by 1 and 1, versus against a whole genome reference. So all of these plots are based on this scenario. So these will change if you use a different scenario, if you use a transcriptomic reference, if you're using a de novo assembly. Yes. Sorry. So again, I don't want to give you any cut-offs. I don't personally do a cut-off, especially with this plot, because we don't collapse RNA and AC data, but I put it out there to visualize the impact. If it's really hard to go back and investigate and ask why is this different from the other samples that we've seen, how come it's much higher than everything else that we've done. But I don't do any cut-offs. What's the number? I guess I don't know, but it usually looks like this. And I've seen some bad samples that have been eliminated, where... Some samples should have been eliminated. Yeah, I don't have that number in the top of my head. Sorry. Yeah, because again, I can't really distinguish whether it's a PCR or an expression, so I can't give you, because that will highly depend on the region of RNA, the gene. So I can't really give the specific number. What time is the break? 15 minutes. Oh, 15 minutes, okay. Let's see. Yeah. Sequencing depth. So again, it's another topic that we discussed earlier in the day. There's a question. I get that question a lot. Have we sequenced enough? How much sequencing do we need to do? How many reads? Do we need how many lanes? So just to add to the discussion that we had earlier, one way that I look at it is by using saturation plots. So what is a saturation plot? We take the bad file after it was aligned, and then we try to randomly sample at different levels. So we sample 10% of the reads, 20% of the reads, 40%, 60%, 80%, 100% of the reads. In each random sample, we calculate the number of splice junctions. Known splice junctions and novel splice junctions. And then we do this point. And what you expect to see is that at some level, the number of reads that you're adding, they're not really adding that much information in here. So you get to a point where, no matter how many reads you add, you've already discovered all the splice junctions that are known. Now the rates of growth of the lines are maybe different between the known junctions and the novel junctions. Because novel junctions, they could be false positives as well, because of the splice detection tool that we use. So they might not saturate as fast as the known junction. And that's okay, because, again, there is a lot of positive, false positive rates in the splice junctions. But if you look at the known junctions here, you'll look at where it saturates. In this slide, I haven't put the number of reads in the 100%. But usually, when you look at the saturation, you say, okay, well, maybe around 40% or 50% was sufficient. So you can do like a pilot study where you do 1, 2, 3 lanes of sequencing. For one sample, do that. And see, did I do too much? Did I do too little? So that's one way of kind of assessing how much we read. But again, this is using splice junction. If you're using it for a completely different purpose, like if you want to do variant calling on the RAC data, then you might look at just the number of reads that cover that site that you're interested in. And you might not want to do this method. Base distribution. So here I'm looking at the number of bases or the percentage of bases that belong to the coding region, non-coding region, right? So that's very important because the different library techniques that you use will yield different distributions or different charts. So for example, if you're doing whole-transcript top libraries, you expect the distribution to have more non-coding regions versus a more regular zone compared to Polyae or any libraries. So the Polyae expect the coding region to be a lot more than the whole-transcript. And if those proportions are messed up, when you look at the QC report, you might want to go back again and check the library technique that was done properly. One concept that I would also introduce that will come up later in the assigned-integrated assignment tonight is the naming or definitions of fragments versus inserts versus enermades sites. So these terms are used routinely, and it can be confusing because sometimes they're actually misused. When I'm talking about fragments, here I'm talking about the distance from the beginning of read 1 to the end of read 2, and that includes the adopters. So the fragment size includes the whole length. When I'm talking about insert size, I'm just talking about the distance from the beginning of read 1 to end of read 2. And the insert size does not include adopters. And the enermade is the inner distance that separates the two reads. And this actually asks you to estimate what the inner distance or expected inner distance is when you run it. And this information usually gets from the libraries. So people who construct a library, they cut that signal out of the gel, so they have an estimation, an idea of what the mean size is. So after you do the alignment, you can look at the distribution of the insert size or the fragment size or the inner distance size depending on what you pick two plots. And after you do that, you look at the mean and then you check with the build-up to see if it's consistent with what you were expecting after a library product. And finally, this is just a snapshot from IGD. This is the one way where you can look at variants. I mean, there are a lot of tools that you can use to call variants. I'm not aware of tools that you can use to call variants on R and AC specifically, but you can use... Now you can use GATK, the new version of GATK allows you to call and be sure to get the variants from R&A data. So in another way, you can do samples, pile up what that does. It goes to every single base in your data and it tries to calculate the frequency of the bases for each base. So I'll tell you this many bases supported the reference and this many bases supported the alternate if there is a variation. So you can use that to calculate the frequency and if you know of certain sites that you're expecting to see variation, for example, you've done it in experiment, the DNA C-level, you already know when SMDs or the variants in the DNA and you want to find concordance in the R&A data, so you take those variants and you go to IGD, you open it up, put the position of that variant and then see the distribution of the bases. So here, for example, we're seeing that the reference is C, but there are reads that have a T. So this is the variation that's present in the R&A. If you want to calculate the frequency, you can simply just count the number of T's by the total number of reads that cover that spot and that will tell you what the mutation frequency is at that specific site. But you can do that. You can do IGD from all of your R&A bases. So you might want to run it through some sort of a caller before you do that. All right. So this is... So I think in the next module, Obi is going to take you through some... So everything you talked about in the top ad, so far you have generated the taskview files. I think the last thing that was done was maybe ask you C, which generally, maybe not. But you can take those tasksview and then go through adapter trimming, alignment and possibly some post-aligned QC. And that will be the end of today. And then I guess we'll go for dinner. And then we can work on when we come back. I'm here at 6 to 8 working on the integrated assignment. So an integrated assignment will be the same style as the stuff that we did during the day. But I'm working in a completely different data set. Also another chromosome. And then you get the chance to download the data, go through it yourself. I'll be around. I can't assist you with anything. And we'll do the same thing. We'll do the alignment today and then we'll leave the expression for tomorrow. Thank you. Do you want to go back and answer any questions? Yes. Any further questions regarding the application or anything else before you go? Could you elaborate a little bit more on the end file up and what you actually got out of that? Yeah. So with end file up, what you do is that it goes through every single, you provide with a thousand items, which is a reference, a sequence reference file, and you provide with your band file. And then what it does, it takes every base in your reference and it goes to the band file and it also looks at the distribution of bases. So you end up with a file that has, in the human genome, it will have 3 billion rows one base per row. And then it will tell you a breakdown of all the bases. It will tell you that this base I saw the reference with A, but I saw five As and three Gs and one C. And then I think you need to take that to a summary. So if you would figure out which ones are variant and then you can calculate the frequencies, like they're variant by this, they're about this frequency threshold and then select all of these and then those are your variants. The problem with that is that you don't have to base so much space because you're doing that per sample. I don't know if you can run pile up for multiple samples. If you run for multiple samples you can run all of the samples. But it operates so much space and it takes a long time. And you don't have to do it for the whole genome. We can provide a subset of a faster reference. If you're looking at super regions or you've captured something you only want the regions that you want. Sure. Yes. And the alignment. Here. So he was asking how come we see negative inert distance? When I wish I had a board I would sketch the board right there. Oh, there it is. Excellent. Yeah, on the desk here. Oh, perfect. Oh, I can do it. Okay, so let's assume your library size is 250 bases that you're trying to sequence. And you have you're doing period sequencing and you're doing two 100 bases. So 100 bases each. So you expect the distance between the reads to be 50. Correct? Yes. Now, let's say that was 200 that we provided in the law. You do sequencing. You still do two by 100 because that's what you usually do. You're going to end up with the distance between the two is zero. Now sometimes you end up with or if you get very small the fragment size and think of it as a distribution as the distribution. So you will get some shorter than the sum of your two reads. So in this case, what will happen is that you will there will be some overlap. And then the shorter it is, the more the overlap will be until you get actually 100% overlap. Now, thank you very much. That's the case. Exactly. Yes, you are losing kind of because you're actually covering that very twice. So ideally what you want is you want the fragment size to be big enough so that you have some inner distance and you don't want the reasonable overlap. And the other thing is I don't think it affects expression but I don't know how it affects variants. So maybe if there is a variant called in the overlap and you're trying to call you're not actually getting two reads. So they're not two independent reads. So it's best if you actually have two reads that are coming from two separate pairs that will be more poof that the variant exists versus two reads that are going to sit there. So ideally you want to check that first. First check what the library expected library size is and then decide on doing whether you want to do 100 basis or not. But the problem is I think nowadays is when you do the sequencing and you don't do it for just one sample you're doing it for multiple samples and usually you let the instrument run 100 basis so you can't really stop it for the other samples. If you have that capability, don't do 100 to I don't know. Yeah. That's the overlap. Contact to the saturation Sure. So I'll go over it again. So you take your BAM file after you online it and you sample from the BAM. So let's say that your BAM file 100%, just a complete BAM file has let's say 300 million reads. So let's say 100 million reads your full BAM file. Then you start sampling from your BAM. So you randomly sample 20% and you generate new BAM files. So you have 10 BAM files that range from 10 million to 100 million BAM files. And for each one of these BAM files you're looking at the number of junctions that you see in that BAM file known or not. So you check a spice database to see how many of those junctions are actually reported. And that's what you calculate. You calculate the number or just the number. Ideally what you want to see is that you want to see some saturation. And that saturation is going to happen at some point because at some point when you add extra sequences you're going to add information because you've already detected everything that can be detected. Does that make sense? Right, but then if you could get saturation close to 100% or something where it's the perfect idea and you're just looking at approximately. So that's a good question sometimes. If you start with a small number of reads you're never going to see saturation. So the point behind this is that you do a very deep readerless as I suggested two or three lanes before you start a project. And then you want to see approximately how many reads do we need to saturate. And then you can take that number and then apply it for the sequences that you need. Do other amount of sequencing? Or do they have sequencing? That may be better. Oh, then for all the other samples that's fine. If you don't want to do sequence how many lanes? So you don't want to spend too much. So that's like the issue between the number of lanes since read depth? So read depth is number because the more lanes you sequence the more of the depth. So yeah, you just give us streams. So if we because we have there C7s Oh, so yeah, there is the split up. So I'm looking at the novel junctions. So any new junctions that we haven't seen before we're looking at no injunction. So the junctions that have been reported before they exist. And all, which is a combination of these two. So this is pretty much the addition of these two. So if you're really looking for normal junctions then it could happen that for novel junctions you have different number of lanes. For sure. Yeah, so if you're looking for novel junctions it really depends on what you're looking for. If you're looking for novel junctions you probably end up needing to sequence more than you are for no injunction. And fine, pretty. Yeah. So this is like great essence that you want to be clear. So you may just use one sample or two samples to determine what kind of sequence that you need for your whole project. Well, again, you might want to look for two samples to stay facing the outside. Okay, go on a break.