 Hello, I am Peter van Hersten from the South African National Bioinformatics Institute. Welcome to the tutorial about sequence mapping. As described in the theoretical introduction, sequence mapping is about finding the position of a sequence read on the reference genome using the sequence information and sometimes information about how the read was sequenced. For this tutorial, I will be working on the usegalaxy.eu server. I've used this server before, so I've got something in my history, but I want to start fresh, so I'm going to create a new history for myself and call it sequence mapping. This tutorial uses some data provided on the Zenodo servers, so I've collected the links to the data from the tutorial, which you can see here. You can just go to the tutorial, copy the links, and I'm going to load that into Galaxy. I'm going to paste FetchData and start. So here Galaxy will fetch this data from the Zenodo servers. This current data is quite small because we're just using it for the tutorial and we're not doing a real analysis. After a while, Galaxy will start analyzing the data that we've uploaded to collect some metadata about it, and that's why it has now gone from gray to yellow like this. Now that my data is ready in Galaxy, the first thing that I want to do is give my data sets more meaningful names. So we see here they have long names. I'm just going to call this reads2, save it, and I'm going to call this reads1. You can give your reads a hashtag by opening the detail view and just typing something that starts with a hash. When we get data from the sequencer, we typically first perform quality control and trimming before doing further analysis. Galaxy has its own tutorial on quality control, so I will skip that step here because I happen to know that this is already good quality trimmed data. Sequence mapping involves mapping reads to a reference genome. A reference genome is a high quality curated genome for the organism that you're busy analyzing. These reads became a mouse or that you cannot tell that by just looking at the data itself. The genome that we will use as a reference genome is the MM10 reference genome. There have been other mouse reference genomes in the past, but over time, as genome assembly improves, we find new versions of reference genomes becoming available. For instance, in human, we have HD19 and HD38. The decision as to which one to use is largely determined by which tools you're using. Some databases are built for a particular reference genome, and then you have to use that reference genome in your analysis. When we're dealing with bacteria, there are often many reference genomes because of the diversity within bacterial species. There are many maps available, but for this tutorial, we'll be using one well-known one called Botide2. So let's find Botide2 here in the tools menu, and there it is. So firstly, it asks us if our reads are single-ended or paired-end, and these are paired-end reads. So we set them up. And then it asks us some questions about paired-end options. So let's have a look at these paired-end options. The minimum and the maximum fragment length. Remember that paired-end reads come from either side of a fragment of DNA, so we could guide the map by specifying how large we expect that fragment of DNA to be, and we leave these at the defaults. Then the upstream and downstream mate orientation. Do we know if the read1 is forward and read2 is reverse and so on? Well, this is something that we'll know from lightly construction time. We don't know with these reads, so I'm going to leave this option off, but if we do have this information, it can make mapping a little bit more accurate. So I'm leaving these as defaults. Then the question of the reference genome. At the moment we're working with mouse data, which is a model algorithm with a well-known reference genome, and that is built into this Galaxy server, so I can just search for mm10. And there it is. Sometimes we are working with a less well-known organism, and we might bring our own reference genome, which we would load into the history and load from the history. The next question is about read group information, which is how to tag the alignment in terms of which data set it came from. This is very useful if you are dealing with multiple data sets, but we're not using it today. And then finally I want to save the barotide to mapping statistics to the history, so I can know a little bit about how the mapping has proceeded. Having done all of this, I now hit execute. Short read mapping is a somewhat time-consuming process. In this tutorial, however, we are dealing with small data sets, and mapping should proceed relatively fast. Our mapping has completed. Let us have a look at some statistics about how the mapping was done. So as we know, our reads are all paired. In this case we only have 50,000 reads, a really small number these days. And then 1,880 of those did not map. And then 3,389 mapped in multiple sets. So when we look at these two numbers, then the reads that did not map and the reads that mapped in multiple sets tell us something about our data. So these non-mapped reads, it could be a result of sequencing errors, it could be a result of divergence between the reference genome and our reads because there is, of course, mutation in the species that is not captured in the reference genome. And these multi-map reads, it could be because of something like, for instance, a repetitive DNA in the genome. Now that we've looked at some statistics about the mapping, let's look at the mapped reads themselves. So this is BAM or SAM format. Behind the scenes it could be compressed BAM, but Galaxy will show it to us in a human readable format like this. And we see here that we have a number of columns. Queue name, flag, R name, etc. So this is the query, read name, the flags, the reference name, the position, mapping quality, cigar, string, I'll get that in a bit. Then the pair read name, pair read position, and then insertion size and the sequence and further on the quality of the sequence and some optional flags at the end there. And then here at the top all these lines start with at signs are header lines and over here we see there are different types of header lines and these are just describing the chromosomes in the mass genome reference that we've been using. Here we get the actual map read. So here's the read name, the flag, then we have the chromosome, the position, and the mapping quality. Now mapping quality is a alignment specific figure. There is no standard for what these numbers mean. Then we get the circuit cigar string. Now cigar string tells us something about matches and mismatches. So we see here 51 matches. Over here we see 21 matches, one insertion and then another 21 matches. Then which reference sequence was matched by the pair but it was the same one and the pair is matched at this position. Then we get this insertion sequence size and that's basically related to the fragments that we were sequencing and therefore the gaps between the different reads. In the read sequence itself it's quality and some further information. So BAM format gives us the raw information on the reads and how they align but if we want to visualize the alignment we need to use a special purpose alignment viewer. I have the IGV alignment viewer installed on my computer and here it is. We can see at the moment it has the mouse mm10 genome loaded but it has no information about the alignments. Galaxy can link our alignment file to the local viewer. So if I click on the title of the alignment file here and I look at the list of visualizations then one of them is displayed with IGV local. When I click that just going back to my web browser notice that Galaxy does something, does some preparation of the data and then feeds it through to my IGV and IGV is now running but we can't see the alignments because there's only about 50,000 reads here and they scattered across the genome. Let's go back to our BAM. Just take one example. So let's look at chromosome one and this position. Chromos one colon and put a position in and we wait a while because Galaxy is now loading the data from the Galaxy server to my local computer and here's the read aligned and notice over there this little eye member from the cigar format it said there was an insertion in the sequence and there is the insertion of AA. So now let's go back to the tutorial and they suggest looking at this region of chromosome two. While IGV is busy loading notice the format of positions in IGV. Chromosome name colon the start position hyphen and the end position these commas are put in here about IGV we don't need to type them. So here we have a region of the chromosome with a large number of reads mapped the colors relate to how well they're mapped and the insert size that is detected in other words the size of the pictorial fragment that these reads came from and this graph at the top here is a graph of depth of coverage so we can see here peak coverage is about 200 in this position and dropping down to nothing here it's very difficult to say anything about this coverage because we have such a small sample but we often do look at a graph like this to see whether we actually have enough reads in a region in order to realistically detect variants talking about variants if we look over here then we can see these reads all seem to have a variant compared to the reference genome looking closer and we can zoom in a bit more and we see here that it's saying that there is a c compared to the t sorry this is very small there's t in the reference genome at this position but here we have a c so that suggests that the sample that we had had a genuine difference compared to the reference in this position if we want to change the color coding of the reads we can right click and there are many options in the ITV right click menu including how to color the reads for instance you can color by which strand they match to let me zoom out a little bit see that more bit see it more easily and then we can go back to the default coloring google IGV read colors to get a guide from the authors as to what all these different colors mean now going back to galaxy i mentioned before that every time i'm working with data on my local IGV it's downloading that data to my local computer it can be time consuming so we can ask for so use a in-galaxy genome browser or j-browse so let's find j-browse and what we want to run is j-browse the genome browser not any of these other organism that not any of these other tools i meant and again we're going to use a built-in genome and we're going to use the mm10 genome if you were dealing with a custom genome you'd mention that here and we're going to create a new j-browse instance we're not updating an existing one so let's insert a track group we can give it a name but i'll stick with default and i'm going to insert an annotation track from BAM pilups so it already selects my only BAM output which is the bowtie 2 alignment and here i want to switch to auto generate snap track because i don't have a deep coverage BAM file and i want to put the track visibility as on for new users now i click run because j-browse is building a whole website for us it takes a while to run my j-browse has finally finished running it took quite a while to run it is only meant to take a few minutes but for some reason mine took more than half an hour so if your j-browse is taking a long time to process it might be because of server issues where j-browse is running and just be patient and find something else to do and come back to it later otherwise i'm going to make this history available via a link that you can click on to get your own copy of the history so that you don't need to wait for the process to finish now i'm going to view the j-browse results and while i'm doing that i'm going to move the galaxy sidebars to the side using these little arrows and the corners because i want to use my whole screen to see what j-browse is showing me i asked for it to allow to enable these tracks by default but it didn't do so so now i'm enabling the bow tie two tracks the including the snips and coverage track and you see it says loading but now this is not loading just on my local browser but it's also loading data from the server that galaxy is running on into the j-browse browser and i look down here and i can see this gray bar is the depth of coverage across the genome so if i take a well represented segment like this and i go there i can then zoom in on perhaps just this region when i do that then i see that the reads are colored in blue and red you saw that that was one of the coloring schemes allowed in iGV in terms of whether it is a forward or reverse read now i'm going to go back to the tutorial and i'm going to take these coordinates so it says here zoom in at these coordinates but when i put this in j-browse this is how with a hyphen is how the iGV displays coordinates j-browse uses two dots between the start and end coordinates not a hyphen so i have to change that like this so we're seeing chromosome two from this coordinate to that coordinate i hit enter and once again j-browse starts loading the data now this a view should be familiar because this is the same data that we were looking at with iGV we can see there the peak of coverage we can see the dips in coverage and we can see some places where j-browse thinks there might be a snip now remember that this is the first approximation a genome browser is not a snip caller so this is just demonstrating that you can use j-browse as an alternative to view your alignment within galaxy without having to download the entire alignment to your local computer and that brings us to the end of our tutorial i hope that you now appreciate how straightforward running mapping is within galaxy but also how your choice of tools and approaches for short read mapping depends on the data that you're working with and your starting point for all of these processes is knowing your data both knowing your reference genome which is the best one to work with for your particular use case and knowing the sequencing platform that you've used for the generating the sequence reads i look forward to interacting with people during the interactive sessions