 Welcome to the Introduction to Mapping. So the questions you want to answer are what is mapping and what is alignment, what is the BAM format, and how can we view aligned sequences. We expect that you will have some background in using Galaxy, so if you do not, please go back and do the Introductory Material. And by the end of this session, you should understand the concept of mapping. You should learn about the factors that influence alignment and you should know how to use a genome browser to better understand your particular aligned data. So where does mapping fit into the next generation sequencing pipeline? After you have received your reads from the sequencer and you've done some quality control, you want to align to a genome and we'll have a look at what that means in just a minute. So when we get DNA from a sequencer, we get it in small pieces called sequencing reads that are part of much larger segments of genomic DNA. And when we get these, we don't know where in the genome they fit in and how they relate to each other in terms of position. So we have two options at this point. If we have a reference genome, that is a complete genome or part of a genome from an organism, we can use mapping to associate the sequence data by its sequence similarity with a position on the reference genome. If we do not have that, or if we want to build a new genome, then we can use a process called assembly to assemble the short reads into a longer genome. So the starting point for mapping is a process called sequence alignment. We are using only the combination of letters in the short read sequence to determine where it fits on the reference genome. So in this example, look at that reference at the top there, A, A, C, G, C, C, T, T, and then we have a read that has some similarities and some differences. So let's have a look at that. A is the same, but then there's a mismatch on the G and the read is longer than the reference genome. So maybe we should insert a gap in the reference genome here, relative to the read, and a mismatch on the G again, and the G and C mismatch, and then it goes, matches all the way down. So when we get reads, they could align to multiple places. So that's called multi-mapping and how do we handle it, depending on the tool. Either you could map to the best region, but what is best and what happens if there's a tie, or you could try and map the same read to multiple regions, or you could map to one of the regions randomly, or you could discard the read. But now, if we want to map to the best region, because they could be partial matches in multiple places, how do we determine the best region? We do that using an alignment score, which we assign for every mapping. The alignment score is calculated by giving a reward for every matching letter in the sequence, a penalty for every mismatch, and then putting in penalties for gaps. So if you're forced to insert a gap, either in the read or in the reference during the process of alignment, insert a penalty, and we can either use linear gaps where every gap gets the same penalty, but more often we use affine gaps where opening the gaps costs a lot, but extending the gap is much cheaper than initially opening it while still being negative. Different tools use different scoring values and give different results. So if we look at this example here of the matches and the mismatches, we see here a match, and this is the running score at the bottom and then we get a mismatch, so that's a penalty and we have to insert a gap and that's a penalty, then we have to extend the gap and that's another penalty, but a smaller one, and then two matches at the end give us some positive values so that ultimately you have a final score of 19 for this read at this position. We don't need to only use the simple letter information when we are doing an alignment score firstly with sequencing that we get of modern sequencing machines, we have not just what the letter is, but what its quality is, so we can look at the base quality and maybe count mismatches of low-confidence bases with a lower penalty than mismatches of high-confidence bases, and then we know that in DNA mutations are not random, it's easy for purines like AMG to change into each other, then it is for a paramedine to change to a purine, and the same thing, it's easy for a paramedine to stay the same, and then we might know something about the sequencing platform and its biases, so we'd optimize for its read length, its error rate, its homopolymer accuracy, and so on and so forth. Talking about the alignment score made it look as if there was one optimal way to align sequences, but it's not as simple as that. We take these two example sequences, then depending on how we want to organize gaps and mismatches, we can end up with different alignments. And for this particular example these different alignments have all been given by different alignment programs that are out there. When we look at these alignments then they imply different variants which are found between the two sequences. If you want to experiment with calculating alignments, here's a link, and you can click on it in the online version of the materials for a kind of game where you can experiment with doing sequence alignment yourself in a browser. In addition to the sequence read information however, we do have something else that can help us sort out complicated alignments, and that is the fact that we typically get sequences in pairs of reads coming from larger fragments. If you know what the fragment length is, which is designed at the time of library preparation, then you know the approximate distance between two reads that are from the same fragment, two matching reads. And then you can use this to deal with ambiguities in mapping. So let's take this example of some repeats. If we had two copies of the same DNA in the reference genome, and we got a match to one of them, we wouldn't know which copy to map to. But with paired-end information we can use the fact that the other pair of the read is some distance away to sometimes resolve a repetitive region. However, it's not quite as simple as that always, because sometimes if we see a increase in the so-called insert size between the two reads it might be an indication that there has been a deletion or an insertion. When you get data from the sequencer, it typically comes in pairs of Farscue files, one file for the forward reads and another for the reverse with a number like a one or a two, determining whether these are forward or reverse reads. Sometimes however, you do get Farscue files that contain both forward and reverse reads mixed together in an insulated format, but because most tools require the forward and reverse files, Galaxy has tools for de-interlacing Farscue. In the files, the order of the read matters so that the nth read in the forward file matches the nth read in the reverse file. It's much faster working this way than trying to match things up based on read name. So that means that when you are doing trimming and filtering, you should always process the forward and reverse files together. Otherwise you might end up with a situation where if you remove a read from one of the files, you now end up with misaligned files. There are so many mapping tools out there. You can see a link at the bottom there to a comparison of some of these and which one should you use. Each alignment tool is going to make different choices during the alignment and this can affect your downstream results. The best tool for your data depends on manufacturers like the type of sequencing experiment that you've been doing, sequencing platform, your compute resources, sensitivity required and read characteristics like pad or single ends or the read length. As has been said, there's no tool that outperforms all of the others and all the tests. The end users should clearly specify their needs in order to choose the tool that provides the best results. Once reads are aligned you get outputs in a format called BAM. It looks like this or SAM, it's the text version of BAM and it has read IDs, read sequences, read positions and so on, but we will discuss that more during the tutorial. These are not really easy to read by eye, so we have genome browsers like this is IGV being demonstrated here used for looking at what a read alignment looks like or within Galaxy you can run something like JBrowns. Then there are external genome browsers that you can run on your desktop and link to data in Galaxy. So what are the key points of our discussion thus far? Mapping is not a trivial problem in many mapping tools the best choice depends on your data. The choice of your mapper can affect your downstream results so that's why you need to know your data and you need to know which mappers are typically used for that data type and finally genome browsers can be used to view aligned reads. Thank you very much.