 I am Peter Van Hirsten from the South African National Bioinformatics Institute and welcome to a short talk on short variant discovery in mycobacterium tuberculosis. So yeah, 2018 figures, about 10 million cases worldwide, 1.4 million deaths from tuberculosis. Before the coronavirus pandemic, this was the world's deadliest infectious disease. The reference strain, H37RV, is a laboratory strain which was isolated in the 1930s from a sample that was collected in the early part of the 20th century. About a quarter of the proteins annotated in the H37RV reference genome are listed as hypothetical. There is other annotation, but it's not incorporated into this reference for what the function of these genes is. It's a slow growing bacteria, it only doubles about once a day. And there's no horizontal gene transfer in tuberculosis. In fact, it seems to use the mechanism that otherwise would be used for horizontal gene transfer as part of its arsenal of weaponry against the human immune system. In the genus mycobacteria, in the group mycobacteria tuberculosis complex, we'll hear a little bit more about that now. The mycobacteria are a diverse genus of pathogenic and non-pathogenic bacteria. The nontubical mycobacteria are all of those except the ones that cause TB and leprosy. So leprosy is even slower growing bacteria, but reasonably close relative of mycobacterium tuberculosis. The nontubical mycobacteria found in soil and water, and they do occasionally cause human disease, but they don't tend to spread from one human to another. Mycobacterium tuberculosis itself is a complex of, well in this diagram it was seven lineages. There are now eight recognized and a ninth has been proposed. And this complex also includes some of the animal tuberculosis bacteria like mycobacterium bovis, mycobacterium penipedi, and so on and so forth. But the modern mycobacteria are these ones, lineages two, three, four, one, and seven. The differences here are largely due to so-called regions of deletion. The genome of mycobacterium tuberculosis is a single 4.4 megabase circular chromosome. In the reference genome in C290962.3 there are 4,018 coding sequences, 56 insertion sequence sites, that will be important in a while, and there's a so-called direct repeat region of 36 base pair repeats with some spaces that are not repetitive between them. And it contains the PE, PPE, PTRS family of repetitive proteins. So here is the mycobacterium tuberculosis genome. Here we can see some of the insertion sequence sites are marked all around the genome, and here is the direct repeat region, and the PPE, PTRS gene to also note it in this diagram. So how can we genotype mycobacterium tuberculosis? The one way to do it is to use one of the insertion sequences that is characteristic of mycobacterium tuberculosis, the IS6110, and use a technique called restriction fragment length polymorphism to essentially count how many of these insertion sequences you're seeing in a sample. Then there is a variable number of tandem repeats method called MIRU, and often these two methods are used in tandem to try and genotype a sample, and then there is something called spoligotyping or spacer oligotyping. Remember I said that there's this direct repeat region. Now those have spaces between the direct repeats that have a characteristic genomic pattern and spoligotyping tries to identify a sample of mycobacterium tuberculosis using those spaces. There's a nice review article that I linked to there. Now strains decades or even hundreds of years apart in transmission can actually share a genotype. So this genotyping is not terribly high resolution. So that brings us to whole genome sequencing. The advantages of whole genome sequencing is if we infer transmission and to some degree genome allele flow, there's no clear snippet threshold to say whether two sequences that are a certain number of snips apart are one transmission event. There's a lot of debate within the literature as to the relationship between snip distance and transmission. But we can generally rule out whether two sequences are from the same cluster much more easily than we can rule in sequences. Also obviously with whole genome sequencing that's necessary if we want to use GWAS to explore genotype phenotype lengths. And very important is that we can use whole genome sequencing to perform in-silico drug resistance testing in a way that sometimes is actually more accurate and sometimes less accurate than phenotypic drug resistance testing or drug susceptibility testing DST as it's sometimes known. Why is mycombatom tuberculosis somewhat easier to analyze than some other bacteria? As I mentioned previously, there is no horizontal gene transfer and recombination, which is common in many other pathogen bacteria. So that would complicate bacterial phylogenies. And you have to identify and mask recombination hotspots when you are computing phylogenies of bacteria that do have these features. And when you have bacteria which are swapping around genes and plasmids and things like that, then you really have to analyze not just the phylogeny but also the flow of genes in horizontal transfer. Something like Klebsiella pneumonia is very susceptible to swapping antibiotic-resistant plasmids this way. And because of the diversity in many bacterial species, for instance, it's difficult to use a single reference sequence for all the species because you can't map all the reads to a single reference when the different parts of the species are so different from each other. And antimicrobial resistance in some other pathogenic bacteria is typically on the level of genes, not point mutations like we see in mycombacterium tuberculosis. So there are some challenges in variant discovery on mycombacterium tuberculosis. The first one is your typical Illumina read is less than 250 base pairs, which limits your ability to discover insertions and deletions, especially if they start getting bigger. And definitely structural rearrangements are more difficult to find because the reads are quite short. Also, the reads are shorter than the length of the repetitive structures. For instance, insertion sequences or the PPE, PGRS genes. So that means that those repetitive regions are more difficult to characterize with short-read sequencing. And the HC-37RV genome is not a neutral target. It's a lineage 4 genome. The lab isolate was separated out in the 1930s. The original patient was in 1905. And there are some significant differences in the EZ6 system. In HC-37RV is different to many clinical strains. And it also has a RVD5 deletion relative to many other clinical strains. So what we might actually be seeing in patients is not in some ways HC-37RV. It's quite different. And as I mentioned to lineage 4, which means it's not neutral when it comes to its position within the diversity of mycombacterium tuberculosis. So this diagram from in mycombacterium tuberculosis lab is quite useful for one of the preprints. So if you align different lineages to HC-37RV, then the number of SNPs that you get is lineage dependent. So you'll see a certain number of SNPs from lineage 4, specifically 4.10, which is where HC-37RV fits. There will be fewer SNPs. But if you're taking something like lineage 1, it's further away in the phylogenetic tree from HC-37RV. However, in mycombus and co-authors developed a inferred ancestral genome, which is neutral with regards to the different lineages largely. It's the same length as HC-37RV, but it just has single nucleotide polymorphosomes inserted to try and approximate what we think the common ancestor of these lineages looked like. So that it is available on Zenodo at the moment. Yeah, getting into further challenges. So contamination of MTB samples is common, especially in direct from sputum sampling. Taxonomic fortune is recommended prior to variant analysis. So that means trying to actually see if the reads that you're looking at are actually like a bacterium tuberculosis reads. However, the commonly used tools for this Kraken and Kraken 2 are very memory intensive. Centrifuge is another similar tool that is less memory intensive. But the question of whether sensitive enough is still being investigated. Of course, most variant cooling software and a lot of bioinformatics is tuned for human data. Its accuracy might not be as good as it could be if more effort was put into bacterial variant cooling. And different groups differ as to which regions they think should be masked out in terms of where the repetitive regions are. It's the open question of the correct mask for the micro bacterium tuberculosis genome. It's a nice overview in the link that I provided over there. Just to illustrate what's difficult about the PEPPE PTRS genes. This is a graph where each edge shows that there's greater than 70% identity and this is using alignment with blast. So you see that there's this cluster of very, very similar genes. So that means that a read from one of these genomic regions might map to any of the other genomic regions from one of the genes that will just throw your mapping off quite substantially. And also around insertion sequences. This is around the IS 6110 insertion sequence. I took reads from exactly the same genome and, in other words, the X3D7 RB reference and mapped them back the genome and this colorful display here shows how the positional mapping goes wrong around this IS 6110 region. So yeah, so repetitive DNA is a problem. And we're hoping that long read sequencing will solve that with technologies from Pacific Biosciences and Oxford Nanopore. But these reads are still quite noisy. The error rate for long reads is high, but much less with some of the newer packed bio sequences. And with Nanopore, the error rate has dropped from about 20% to under 5% in five years. So we're hoping in a few years time we'll be able to use long reads. Unfortunately, there's one characteristic error that we find, especially with Nanopore, which is that the polymer polymer errors that is, for instance, changing from GG to GGG. They are still copying up, even in the newer Nanopore technology that I've seen. And the methods of DNA extraction are well studied when it comes to short read sequencing, but are still more challenging when it comes to long read sequencing, especially with Nanopore. You really want long stretches of DNA, but might go back to M-tropicalosis. The cell wall is quite tough, and it's quite tough to break that cell apart and get good, long DNA out of it. If you have both long and short read technologies, they allow for rapid de novo genome assembly and thereby investigating clusters with very high resolution because you can actually build an outbreak cluster. Some bioinformatics tools which are described in, also in the tutorial we have for Galaxy, TB variant filter allows you to apply common filtering operations to predicted variants. So once you've predicted some variants, then you don't need to go look up where are the PEPPE regions. TB variant filter will apply those filters for you. Then from Sandb, we've created a tool called TBVCF report, which annotates each variant with links back into the combat TB in the ODB database so that you can learn more about the genes that have been annotated in the variants that you've identified. And finally, Galaxy also has a wrapper for TB Profiler, which is a drug resistance and lineage prediction tool from Jody Fielin at the London School for Hygiene and Tropical Medicine. There are the tools for drug resistance prediction, but this is one that's easy to use and is available on Galaxy servers such as usegalaxy.eu. I have to acknowledge some people who gave me amazing insights for writing these slides in like e-commerce, I mentioned previously, worked on the ancestral inferred ancestral reference genome and his group has worked on contamination in samples. Conor Meehan, who's an all-round expert on micro-bacterial tuberculosis bioinformatics and is an amazing tweeter, just really helpful. Caroline Collines worked on transmission modeling in micro-bacterium tuberculosis and TB, Jody Fielin for TB Profiler and then the combat TB group at Sandb, that is Tobol Lose and Zipo Zake Mushelogo. And Torsten Simon, who is a powerhouse of micro-bacterial bioinformatics and helped comment on these slides and write snippy and shovel and so many other tools. And the South African National Research Foundation and the Medical Research Council that fund our work at Sandb, and I'm sure I've missed off some people. Thank you very much.