 Hello, I'm Peter van Hursen from the South African National Bioinformatics Institute and welcome to the tutorial on mycobacterium tuberculosis variant analysis. So mycobacterium tuberculosis is the pathogen that causes TB, which until recently was the world's top killer infectious disease in 2020. It got pushed aside by COVID-19, but still we think that more than a million, maybe a million and a half people worldwide died of tuberculosis last year. The m tuberculosis genome is a 4.4 megabase genome. In the tutorial, you can find out more information about it and the typical reference strain that's used is something called H37RV. There is not a lot of variety within mycobacterium tuberculosis genomics and there is no recombination, but we can talk about that in the tutorial on mycobacterium tuberculosis genomics, which you can watch alongside this tutorial. So what is it that we are trying to achieve with mycobacterium tuberculosis variant analysis? A couple of things that we are interested in. Firstly, which lineage of tuberculosis are we looking at? Because this can be significant for research purposes in the future and maybe in the future we will understand what these lineages mean in terms of clinical outcome. Secondly, what variants exist in the genome that are associated with drug resistance? So mycobacterium tuberculosis is quite complex and unfortunately drug resistance tuberculosis is on the rise worldwide. And thirdly, the pattern of variants that we find within the genome can be used within outbreak investigation where we try and find out which cases are connected to others. So let's get analysing data. I'm going to get the data from the tutorial and I'm going to copy these links that are in the tutorial and I'm going to usegalaxy.eu where I've already created an empty history and I've called this m tuberculosis sequence analysis. So then I click my upload, paste fetch data and paste all of these at which point Galaxy will start to load them for me. So Galaxy has finished processing our uploaded files. I want to first change some names, just remove everything here to call the 00422 fastq.cz and then do the same with this file. So this is read1 and read2 from 004-2. The data in these two files, fastq files was collected from a southern African TB patient in a study which was examining reinfection versus reactivation of tuberculosis infection. First thing I want to do with these files is some quality control. So I'm going to use fastqc to search for it in my toolbox here, fastqc and I'm going to use the multiple dataset selection button and I'm going to choose these two files and then I will click execute. So while fastqc is busy running, I'm going to get ready with my multiqc. What am I doing here? I want to not have two individual reports. I want to aggregate the data into a single graph. So here in multiqc, I say fastqc is my input and then I want the raw data input and I'm going to use control and click to select the raw data from those two files and I can do that and Galaxy will put it in the waiting queue waiting for my other data to process. So my other data started processing once this fastqc is finished processing, multiqc will start. So now multiqc has finished running, so let's have a look. I icon here on the multiqc on data 9 and 7, web 8. Give it a few seconds to load and there it is. I'm going to move my sidebars to the side to give the multiqc window more space that I can look at. I can move my sidebars using these little arrows in the bottom corners of the Galaxy interface. So what we see here, percentage GC is 66 percent. That is what we expect from a microbacterium tuberculosis, quite a lot of duplicate, probably PCR duplicates coming in here, 5.3 million sequences and let's look at the quality. And that is a pretty decent looking quality profile. We see here that the two reads for the reverse both have quality mostly above 30, just dipping here at the end. And then we look to see if it was the adapter content. Looks like there might be some sequencing adapters left in some of the sequences. But it's quite a low percentage overall. So think about like how you handle these sequences, but while you're thinking about that, let me bring my Galaxy interface back. I'm going to move on to some quality trimming. Now there are many different quality trimming tools, but I'm going to use Traumatic. It is one that I know fairly well and I like it. So here's Traumatic and this is paired end sequencing. And there we put the pairs in. Notice that I'm handling these sequence files as individual files. If you're dealing with a large amount of data, you're probably going to want to use a collection of paired end data for this analysis. The average quality by default is 20. I'm going to bump that up to 30 because I saw the quality was quite good on those sequences. So I don't think I'll lose a lot by doing so. And then I'm going to add a second operation, which is that if anything drops below 20 bases, we should discard it. And here Traumatic has started running. You see it will give me both the paired and unpaired reads. The unpaired reads are if it decides to discard a read from one of the files, but not the other then we'll have some unpaired reads left, but mostly we'll be paying attention to these R1 and R2 paired. And while Traumatic is already running, I'm going to do another quality check. So I'm going to use tool called Kraken 2. What Kraken 2 does is it tries to assign a taxonomic label to each sequence read. So I'm going to choose my paired sequences. And I'm going to use the R1 paired and the R2 paired outputs from Traumatic. This Kraken can run after the Traumatic has finished running. I want to use the scientific names. And then down here in the create reports section, I want to print a report with aggregate counts. So what Kraken 2 is doing, of course, I have to select, I want to use the standard database. Let's see the most recent standard database, which is a database of essentially most of the sequences that you find in NCBI across different domains of life. And what Kraken 2 does is it tries to, in this report that we're going to see, give us some sense as to whether the species is predicted to be mycobacterium tuberculosis. I said earlier that mycobacterium tuberculosis doesn't have a huge amount of diversity within the species. So that makes sorting out specific different mycobacteria quite difficult. I should have said it doesn't have a lot of diversity within the genus. There's more similarity than you might find in some other species. But yeah, let's set this running and then we'll run after the Traumatic. And Kraken 2 is quite a resource intensive process. So that'll run for quite a while. If you want to skip this step and just work the tutorial faster, I will make a saved history of this tutorial available for you to download. So you can just see what the Kraken 2 output looks like. So at this point, Kraken 2 is busy running. Traumatic is run. There's not much to see here because this is just Farscue files. But if we have a look at them, we'll see that the two paired-end files, 257 megabytes each, if we compare that to the input files, we see there is some change in size. Let me just go back to confirm that. And then at the unpaid, it's a very small amount of reads in the unpaid section. But we see that there has been some trimming down of our data to get our Traumatic Farscue. While Kraken 2 is still running, I'm going to go back to my regional uploaded data and I'm going to change some names. So I'm going to just remove the prefix to this mycobacterium tuberculosis ancestralreference.genbank and do the same with the .fasta file. Now what is this file that I'm looking at? As I mentioned previously, the standard reference genome for mycobacterium tuberculosis is H37RV. So I'm going to do the same with this chromosome file. And H37RV belongs to lineage 4 tuberculosis. Inarchicomas and colleagues worked with the diversity of mycobacterium tuberculosis genomes to infer what the ancestral reference genome should look like, but only inserting single nucleotide substitutions. So the inferred ancestral reference genome is exactly the same size as H37RV. That means the annotation still works the same. But in terms of the sequence content, it is more neutral when it comes to the sequence being closer to what we think the ancestor to all lineages of tuberculosis looked like. So it's quite useful for building phylogenies, especially because you no longer getting bias from comparing your samples against a lineage 4 reference. And so now Kraken has finished running. So let's have a look at the output files from Kraken. This is the classification of each individual read, many, many millions of lines. And this is the report generated. So it starts here by saying, well, 95% of all reads could be classified as something. So there's only 4% so it could unclassified. Then we go down and read it down. And we see that 94% are classified as mycobacterium. Mycobacterium tuberculosis itself, the complex can only count for 14% of the reads. But that's because it's not able to unambiguously classify further down than mycobacterium. But that suggests that we are looking at a mycobacterium tuberculosis sample without significant contamination. So now that we've finished with our quality control work, we can map these reads against the mycobacterium tuberculosis genome. And for that, we're going to use a tool called Snippy. Snippy is essentially a pipeline written in Pearl by Torsten Siemen that does mapping, some cleaning up of the mapping and then variant calling. It does mapping with BWA and BWMM and it does variant calling with Freebase. So let's look at the options. We're going to use a genome from the history and build the index. Now here I'm going to choose the GenBank file because if I use a GenBank file, Snippy will generate a VCF that also has information about which genes the variants were founded. Now I'm going to use the R1 and R2 paired reads. Then in the advanced parameters, I'm just going to drop this down to 0.1. So normally at 0.9, in other words, finding variants only if 90% of the reads agree, I'm going to say 0.1 to see even rare variants. And finally, I want the VCF format. I want the tab separated summary of all variants. I want the alignment in BAM format. I'm not interested in Snippy Core because I'm not analyzing multiple samples at this point. Just checking my tutorial. And those three outputs are the only ones that I care about. Then I hit execute and the mapping will now commit. Mapping and variant calling. So now that Snippy has finished running, that should take about 20 minutes. You can see what our output looks like. OK, this is the alignments in BAM format. Then we have VCF format. We see here various variants. And because of how we used a GenBank reference, then we can see that the variants annotated in terms of their location on the genome and what genes are happening in that region. So let's have a look, 1,087 lines in this tabular format output. And this means that we've basically got 1,086 variants, not different types, SNPs, deletions, and so on. And this is what we should expect from a typical macrobacterium tuberculosis sample. We have approximately in the range of between some hundreds and maximum maybe 2,000 variants compared to the reference genome. So we've done various steps of quality control to get this point. We've looked at the sequence quality with FastQC and MultiQC. We've checked for contamination with Garkin2. And internally Snippy does its own quality controls. To make sure that we get good alignments specifically, you might have noticed in the advanced parameters section, there is a parameter for minimum mapping quality, which is used internally by Snippy to ensure that we have good quality alignments coming out. And then finally, we've used the inferred ancestral reference genome, which, as I described earlier, has some advantages for variant calling in macrobacterium tuberculosis. So one of the challenges with macrobacterium tuberculosis is that the genome has some repetitive regions in social sequence sites and also the PEPP, PGRS gene family. So there is a tool specifically for filtering out these problematic regions and just doing in general VCF filtering for M tuberculosis. And it's called TB variant filter, and I found it now. And we're going to run it on the VCF file. And let's see which filter should we use. So filter out variants by region. So that looks, if we just look at the options available here, the default regions that are filtered out of the PEPP regions and the tool UVP has a useful list of repeats and insertion sequence sites. So those regions are also filtered out. And then some other filters we might want to use is remove variants close to indels. It's still being debated whether this is still necessary, but in general, when an indel happens, then alignment around it, a few bases around it. And by default, that is five bases can get a bit uncertain. So then there could be false variants around indel. So that's why that photo is there. And then very commonly used photos, photo sites by alignment depth. So the default here is read depth of 30. We want at least 30 reads supporting each variant. Not going to change any of the default options for the specific photos. Just use those. Just use those photos as they are. OK, a TV variant filter is busy running. So now let's do something else in parallel. TV Profiler is a tool to infer drug resistance from the mycobacterium tuberculosis sequence data. And it can take input straight from the Illumina sequences or you can feed it the aligned BAM, which we snippy already produced an aligned BAM. And it can either output in text or PDF and just checking the minimum allele frequency is 0.1, which is the same frequency that I was using in snippy for its variant calling. And let's hit. OK, so this takes the output from the snippy and it's running in parallel with TV variant filter. It is not using the output for TV variant photo. You can run it on VCS, but because it has its own variant calling, I prefer to run it on the BAM. When you are using an aligned BAM for TV Profiler, then it is useful to keep the... Well, it's not useful. It's necessary to use a alignment against 837RV chromosome so that TV Profiler is able to understand the variants in terms of the coordinate scheme of 837RV. TV Profiler has now finished running. This one took only a few minutes for me. And what does it give us? So TV Profiler gives us output in JSON format, which is a format that is designed for computer programs to read. It does its own VCF calling, but this is only focusing on regions that it considers of interest for drag resistance. So you see it's only 43 lines, quite a bit smaller than the VCF from snippy or TV filter. And TV variant filter. And then it produces a report. So let's have a look at its report. This report can be in PDF. This one is a text format. Firstly, it tells us what lineage we're looking at. So this is a lineage for the so-called Euro-American lineage. To be closer, sample that we're looking at. Despite the name, Euro-American lineage is common worldwide, at least in part due to colonialism spread of Europeans around the world. It gives us a predicted Spoligotype. And then it talks about predicted drag resistance. So here are some characteristic variants, RPI-B, SERR 450LEU. And this is the percentage of region in which that was confirmed. And so we know with Rampus and Resistance is largely associated with mutations in RPI gene. Isonizid resistance, we have here a change in the KG gene and so on and so forth. So unfortunately, this is showing that this sample is predicting resistance against all the first line. Typicalosis drags and it's classified as multi-drag resistance. And then there are variants in other positions which TV Profiler did not associate with resistance. Now, before we finish off this analysis, I just want to look at a detail here in the VCF. So when we use Snippy with a GenBank reference to create the VCF, then all of the gene names have a prefix gene in front of them. Now the tool which I'm going to use next does not handle that prefix very well. So I'm going to use the set tool or text replacement to remove those prefixes from the VCF. So my set program is simply take gene, remove it across the entire line, and I'm going to run it on the output of TV variant filter. Execute. Now that I've got filtered VCF and I have the output from running TV Profiler, I can use the TV variant report tool which uses the Combat TV Explorer database and the data we've generated thus far to produce a nice to read report on variants that we found within our sample. So my output is the JSON. In the second option here, the drag resistance reports, I'm going to use the JSON output from TV Profiler, and I'm going to use the text transformation on the VCF as an input. And then I'm not going to, I'm going to filter out some of these variants between in the intergenic regions because the annotation sometimes shows us that a variant is intergenic when it actually is in a gene. Anyway, it's just a snippy of it gives us multiple reports for the same variant. Just going to have a look at the advanced options which database to use. We will use the one back at Sanby. Okay, so that is waiting for text transformation with said to complete running. A later version of the TV variant report tool might not need this step. So now my set has finished running. If I look here, I still see the same variant report, but now the RV locus numbers rather than gene underscore, etc. I'm going to just look, this is sample 0042, and I'm going to just change the name here to variants so that this is a meaningful name in my history rather than what it was before. And now we can see that TV variant report has also finished running. Now this uses the Combat TV Explorer database to annotate the variants, but that database is currently running on a server in Cape Town, South Africa. It is an open source project though. So you can actually run a local copy of the database if you want to, and if you don't want to rely on our infrastructure. Again, I'm going to make the window a bit bigger to see the outputs. Here we can see just the HTML slightly prettier version of the same drug resistance report that we got from TB Profiler. But here is the interesting output, which is the TB variant report, where instead of looking at VCF, each variant is now annotated in terms of its gene, which protein it is associated with what the expected impact is. So we've got here, this is a synonymous variant, so the expected impact is low, pathways if they're known that this protein is involved in, and the position in the genome. So if we control click on this position, it'll open a window on Explorer at Sanbi, which, sorry, it's taking a little while to load, there it is, which is J-Prowse, genome browser, and this variant is in this position. This is showing us that synonymous variant in that position at that locus. This is a locus in question. If I right click this, and I say search for DNA in, then it will open a window on top, which will load the results page for DNA in or here at the top, there's a link, I can just control click that, and then I get a full page for you. And this is what is known about this gene. It's a DNA polymerase, it's a sequence and so on and so forth. So then if I take this position and I control click it, it loads the J-Prowse and it shows specifically where in the genome this variant is happening. So you can look at the genomic context of the variant. Finally, for this tutorial, we might want to see the alignment that underlies our variants. So we can do that within Galaxy using J-Prowse. The advantage of J-Prowse is that we don't need to download the alignment to our local computers to visualize. The alternative is using ITV and doing the download. So we're going to use the genome from the history and then we want a new J-Prowse interest. We want us to be the bacterial code. I'm going to refer to the tutorial for these settings. So a new J-Prowse instance and then we want to insert track group and I'm going to call this sequence reads and the first thing that I want on the annotation track is the pile apps from the BAM that is from Snippy. I'm going to leave the auto-generate snip track. I'm going to say yes and I'm going to put this on for new users and the next annotation track I want is the variants. So that's VCF format. So this is BAM then we have the variants and we're going to use the VCF variants, the most recent one in our history and we're also going to put this on for new users and finally I'm going to insert a new annotation track group and I'm going to insert the annotation track for micrometric tuberculosis TFF annotation and then I want that to be a canvas feature that's in the track type. That is a canvas feature and style.label I'm just going to call that product and style.descriptor I'm going to also call that product and then I'm going to say that this should be on for new users. Execute and now it will go and build J-Prowse for us. This is somewhat of a time consuming process. Okay, our J-Prowse has completed. So let's have a look again. I'm going to hide the side bars here despite what I asked for. It is not showing the annotation in our first load so I'm just going to have a look look for a variant of interest. VCF variants. It looks like there's something interesting around this position on the chromosome so let's zoom into there and here is the deletion that we heard about and I'm going to zoom out a little bit just to give us a bit more context. There is the BAM alignment and there definitely appears to be some kind of a gap going on here which Freebase has classified as a deletion. I'm going to zoom out a little bit further and a little bit further so that I can see the read context. In here we see the reads. Paid end reads around this part of the genome. Zoom out quite a bit further and I'm now going to hide the actual read track so I can just see the coverage plot and we can see a dip in the coverage over there. Let's take another example. So this is the CATG mutation that we are interested in for drug resistance so I'm going to have a look here in the report from DP variant report. Search for CATG. This is the problem of the annotation using RV numbers. So CATG must just remember what is the RV number for it. This is RV1908C and so that shows us that the variant is at this position in the genome and I will go back to my J-brows. Look at that position and now zoom out some and there we can see very clear evidence that there is a variant at that position which is the one associated with drug resistance. That brings me to the end of the tutorial on mycobacterium tuberculosis. Variant calling and analysis with Galaxy. Look at the end of the tutorial for some examples of samples that are not as straightforward to analyze. It is always useful to know what kind of problems you can encounter in sequencing and bioinformatics. I hope you enjoyed the tutorial.