 Here are our accession numbers, and here is the genome we would like to map against. The first step is to download these SRA data sets from Sequence Read Archive. To do that, we're going to go to Get Data, select Faster Queue Download, select List of SRA Accessions 1 per line, make sure Accessions is selected here, and we'll click Execute. Now this is finished. You can see that this tool generates a number of data sets. A priori, you usually don't know whether the data sets that you're downloading are going to be single-and or paired-and data. So it creates two collections, a single-and collection and a paired-and collection. In this case, only paired-and collection contains any data sets, as you can see. So therefore, both of these are paired-and data sets. So the first thing we're going to do is we are going to QC this data using Fast P tool. So I'm going to find it here. And we are dealing with a paired-and collection here. This is the collection. And the only other thing that I'm going to change here is in the Output option, I'm going to set Output JSON to Yes. Now Fast P is finished. What it really does is it trims adapters and it can also trim low-quality bases. Now we can map these reads against the genome. I upload the genome into History. It's right here. I'm going to be using BWAMM. My genome is now in the History, so I can choose this option. Select the genome. It's right here. And my reads are in the form of a paired collection. And the collection I'm going to be using, because I have actually several collections now, this is the original collection. This is how data was downloaded from SRA. And now it's processed by Fast P. So that's the data set I want to use now, this data set number 11 in this case. I'm not going to change any other parameters. I'm just going to start running it. Next, let's remove duplicates. Removal duplicates tool is a part of the Picard package. So we're going to go here and select Mark Duplicates tool. I'm going to use Collection as an input. This time it's a collection produced by BWAMM. It contains two BAM files. And I'm going to switch this to Yes, because I don't want duplicate reads to be included in the output. At this point, we can use this deduplicated mapped data set to actually go ahead and start calling variants. So here, we are going to call variants using Tool Code Low Freak, which is specifically designed for calling low-frequency variants in mixtures, such as, for example, viral or bacterial samples. So I'm going to scroll and find Variant Calling section. And Low Freak is actually a collection of tools. So the first tool that we're going to be using is called Re-aligned Reads. It takes care of the indels to make sure that the indels are inserted in a consistent way. Again, we're using Collection, now the output of Mark Duplicates tool. It does require a reference, and that reference is in our history. So we're going to select this genome as Fast A to do this. The next step is to insert indel qualities. This is required to be able to call indels. Again, we are using Collection as an input. This time, this is Collection produced by Re-aligned Reads. We're going to use Dindel approach, and that requires reference genome again. So we're going to select our reference from history and run it. And now we can actually call variants with Low Freak proper. So input, again, as a collection produced by Dindel Recalculation tool, we're going to select the reference genome. Now, this tool requires a little bit more configuration. So we want to call both single nucleotide variants and indels. And we're going to tweak some of the variant calling parameters. For example, let's set minimal coverage to, say, 50. Set minimum base quality to, let's set it to 20. And mapping quality to 20. That means that multi-mapped reads will not be used for calling variants. Everything else we can leave the same. So variants are cold. You can see that we now have two datasets in VCF format. VCF format looks like this. Now we need to annotate them, meaning which variants are encoding regions, which are not, and so on. For this, we're going to be using SNPF, but we're going to use a special version of SNPF designed specifically for SARS-CoV-2. It's found here in the virologist section. So I'm going to choose that. Again, input is this collection produced by Low Freak. This is the correct version of the genome. It's version two. I'm going to select these two fields. Nothing upstream, downstream. Nothing here. And just write. Now that this is done, you can see that these datasets are simply, again, VCF files. These VCF files now contain much more information about what these particular variants do. So to make downstream analysis easier, I would convert this into a tab delimited format, and I will do this using SNP-SIFT tool. SNP-SIFT extract fields on this collection. The fields I'm going to enter like this. You don't have to retype this from the screen. You see this right now. The same list of fields is listed in a corresponding tutorial, which goes along with this video. I will select one effect per line. We don't really need multiple field separator here, but I will use dot for an empty field. And OK. You can see that at this point these are simply tab delimited datasets. So they're a little bit easier to work with. But it's still two datasets. What we would like to do is to combine all of this into a single one, which we can then use for secondary analysis. For example, we can load this in Jupyter notebooks or do anything we want with it. So in order to convert this, it's a collection I need to collapse it. And I will go here to collection operations, use collapse collection tool. I will select this collection I want to collapse. I will use one header line. And I will also prepend file name. Same line in each line in dataset. This will essentially set the dataset name, which is these. It will prepend every line in both datasets with the name of the corresponding dataset. For example, everything from here would be prepended with srr1195410. So let's see what we'll get. So now it's no longer a collection, it's just one dataset. We click on the I icon, you will see that there's a new first column, and that column indicates from which dataset these data comes. Of course, in this case, here we only have two datasets, but you can imagine how this would work exactly the same way. We have 1,000 or 10,000 datasets. And so this is the table we can work with. That's the list of variants that we would be interested in. At several points in this tutorial, we generated some summary statistics, logs, about the data that we have. The first such log, for example, is this JSON file generated by FastB. And the other one is the mark duplicates output, mark duplicates metrics. We can use two of these summary statistics datasets to paint a picture how good our data is. And for this purpose, we're going to use Multi-QC tool. So first, let's tell it about FastB. So this is our dataset number 13. That's FastB on our dataset collection. We're going to select that. And the second is the Picard tool, and specifically mark duplicates. This is the other dataset. So it's going to be collection number 25. Let's visualize these two summary statistics. If you click on the I icon, you will see this graphical output of Multi-QC. And there's a lot of information here, for example. There's information about how many reads you have, what's the percentage of duplicates, how many of them contain adapter sequences. You can actually see that in one of the datasets, there is quite a few. And this is perhaps one of the most important metrics here is that the distribution of base qualities. So you can see that these datasets are actually quite good.