 Okay. Hi everybody. So I'm Gary and I'll be giving talking to you. I'll be providing the lecture for module nine, which is on environmental genomics and lateral gene transfer. But a little bit about myself. So I am the chief of the bioinformatics section at the National Microbiology Laboratory. It's part of the public health agency in Canada. I was hired in 2005 to establish the bioinformatics lab at the National Microbiology Laboratory. My arrival coincided with the acquisition of our first high-throughput sequencer. That was a 454 sequencer and actually it was the first installation of a high-throughput sequencer in Canada. And I knew that this was going to probably be pretty important and will impact a lot of the work that I was expected to do. That has turned out to be true. So back then, genomic epidemiology of infectious diseases didn't even really exist. And so basically it's been about a 17-year process of building up genomic epidemiology within the National Microbiology Lab and more broadly. And I think this data career objective is to deploy genomic epidemiology as broadly as possible across Canada and also globally for public health applications. And training is a very big component of deploying genomic epidemiology capacity. And so I am just thrilled to be here to be talking to such a large group of people, our next generation of scientists who are interested in the field of genomic epidemiology for infectious diseases. I'd just like to remind everybody that the lecture is being distributed under the Creative Commons Attribute and Share License. So that allows you to copy, distribute, and transmit the work and to adapt the work so long as you attribute the work to the authors. And if you do alter, transform, or build upon this work and you redistribute it, you must redistribute it under the same Attribute and Share License. Okay. So here's our module on that mobile genetic elements and environmental microbiome. Which I'll be presenting to you on behalf of Professor Rob Pico, a long time collaborator of mine, but unfortunately Rob couldn't be here today to give the lecture. And so, which is a shame because he is very talented public speaker, and full of all sorts of wisdom that I lack. So, but fortunately, I have Zach Light, Finney McGuire is on standby here to help me along if I stumble. And we have a quite a bit of content and not a lot of time to get through it so I'm going to just dive right in. Here are our learning objectives. So by the end of this lecture, hopefully, you will understand how environmental sequencing difference from clinical genomic sequencing, understand how environmental samples are used for pathogen surveillance. There are many different types of mobile genetic elements that mediate horizontal gene transfer, lateral gene transfer, understand the impact of horizontal gene transfer on the analysis of microbiome. I'll hand on regular clinical genes and then to know the mean bioinformatics tools that are used for pathogen detection characterization. So part one, wanted to just go and that to this point this is largely review for you but I wanted to go over one more time on clinical sequencing versus environmental sequencing and draw out. So let's just go through the important consequences of that difference and how we tackle the analysis of environmental samples, especially in the context of antimicrobial resistance and lateral gene transfer. Okay. So that's a simple example. So, and that's for SARS-CoV-2 and Dr. Simpson it's already given us an excellent lecture describing how this is performed but just for your review. The way that it works is we get a clinical specimen through like a nasopharyngeal swab that is preserved and transferred to a lab. RNA is extracted, converted to CDNA, sent through a sequencer, and that pumps up the genomic sequence information and then we use that to perform our analysis. For environmental sequencing and in the context of SARS-CoV-2 this is wastewater sequencing. So what happens is, well, people shed the virus through feces. And that is collected at wastewater treatment facilities and it's sampled there. So it's sampled as a collection of viruses from a community of people who are contributing virus through wastewater. And so that sample is performed, the analysis is performed the same way. Well, then there's a little bit of work is a different type of sample, you have to concentrate it, etc. But ultimately you're extracting that nucleic acid, that nucleic material, RNA, from that community of viruses. And, well, until I show at the top, you send it through for RT-PCR, and then that will give you insight on the different, how much viral load exists in those communities. It's a nice and quick way to be able to see if there may be emergence of a highly transmissible variant whose increase in the viral load. And then it's all the second part is diverted out for all genome sequencing. Okay. Okay, so we know how the sequencing is performed for SARS-CoV-2 through the use of these pools of these primers that are designed to tile across the genome, and the replicons that are generated from those PCR reactions are sent into the sequencer, and the reads that come out of the sequencer are mapped to a reference genome. See that here on the right. And then you look for mutations in the reads relative to the reference. And those, the collection of variants, the collection of mutations are used to assign the, that genome sequence to a given lineage and for additional types of analysis. If you're doing the environmental sequencing, it's a metatonomic approach. So you have a collection of genomes. And in this toy example, I have two genomes of one lineage, pink lineage, and one, and one genome of the blue lineage. And the replicons will tile across these genomes. The genomes themselves may be fragmented. You may not get perfect coverage. Certainly will not get perfect coverage across all the genomes. You get fragments of those, of coverage across the collection of genomes. But you take those replicons, you run those through the sequencer, you map them to their reference, but now you're getting this collection of different mutations that occur in different proportions. So in this example, we can see a C to G mutation in the pink lineage. It's wild type for the blue lineage and exists in a two to one ratio. And in this example here, we see a C versus G reference, so the pink of the reference, the mutation, the blue minor circulating lineage has a C in that position. It has the T which is wild type and the C is from the pink lineage. But again, these exist in this two to one proportion. And so we can use that information to help us try to disentangle what exists in there. There's not much disentangling going on though, just because the data is so fragmentary. And typically of high CT value or low viral load. So, but there's two ways to attack the problem. One is to generate a consensus sequence in the same way that we generate a consensus sequence from the, from clinical genome sequencing. And the top arrow here captured those mutations, and we can assign it to a lineage and then we do that, you know, thousands of times, or hundreds of times or however many times. And this allows us to get up. So remember this is coming from from waste water surveillance but it gives us essentially a distribution of the proportion of the lineages that are circulating within that community or aggregated nationally or however you want to do it. So what we do here at the National Microbiology Laboratory is we, we take a look and see well what is circulating in wastewater at the national level. This is an actual plot of that. And the second method is to look below the consensus of the sub consensus and here is no presumption that you're going to get to try and infer different consensus sequences for the different lineages. So take a look at the different mutations that you've found in there, and those may be indicative of a given lineage, or they must, they might just be the types of mutations and so, so in addition to monitoring lineages, we also monitor for mutations. There are certain notorious mutations that have arisen and while it's a dynamic process changes over time, but we want to look for those and see if those mutations of interest are also increasing in prevalence or if they, especially if they increase rapidly in prevalence and that can give us a sense of what may be going on in terms of whether that mutation or that collection of mutations may be associated with, for example, increased transmissibility or increased immunobasiveness. Okay, there. This is a link for you just for your own reference for the COVID-19 wastewater surveillance dashboards and national dashboards and this is coming from the RT QPCR data that was made publicly available. And the genome sequence data is not really of the quality where we are comfortable making it available on a government dashboard so it's at this point it's basically just use internally submitted to Gisei for other people to look at if they want. Okay, the pipeline that we use for performing a lot of the wastewater genomic surveillance is called viral recon and I'm not going to go into the details of this pipeline. So, you know, given that you've seen the lecture from, from Jared Simpson, you will probably recognize that there are a lot of sub components in the pipeline that are very similar things like the variant collars. SNPF to see what the functional effect of variant might be, etc. pangolin assignment here next plate assignment, etc. So, so, so that's kind of like a simple example of environmental sequencing as applies to SARS-CoV-2. Actually for the rest of the lecture I want to focus more on bacteria and the although the concept essentially is the same but there are some notable differences. One notable difference is when you're sequencing in a bacteria in a clinical context you are typically culturing the that organism the bacterium and selecting an isolate sequencing the isolate, and then you'll get the whole genome sequence for that isolate that includes all of the main chromosome and it also can include things like plasmids and other extra chromosomal elements. And, and so and it's not always the same like it is with SARS-CoV-2 or you're looking for differences in mutations there's actually difference in gene content. Well, you know within certainly within within a species, more or less depending on which species you're looking at but also even within lineages that are circulating for example within the milk break. So there's a little bit of variability but you capture all of that variability in that when you get that whole genome sequence when you're doing environmental sequencing for example if you're sequencing a farm or or some other habitat you are the environmental sampling is going to give you a collection of microorganisms. And there can be substantial diversity in there you but the so you extract the the nucleic acid content from that community of microorganisms. So you get a metagenomic approach to do the sequencing there but in this case you're, there's no guarantee and it's very likely the case you're not going to be able to generate a to regenerate an entire genome for even for even for a given bacterium, let alone for all of the back to the bacteria, the diverse bacteria that may exist in that sample. And so that is a bit of a problem, just sampling bits of it and the problem is becomes especially important when we are trying to monitor and for example perform surveillance of antimicrobial resistance and lateral gene transfer sample here that I'm showing you at the bottom, we have horizontal gene transfer region so this is a feature of so soon that this is this is a section of the genome that is mobile, and it contains these antimicrobial resistance so one today. But because of the random sampling and partial sampling that we're doing for this one segment of this one genome in this community of genomes, there's no guarantee that we're going to be able to sample all of the antimicrobial resistance that that is contained within that genome for that organism. And we even if we do we're likely only able to grab part of that fragment of that gene and Dr. MacArthur talked at length about some of the difficulties in doing trying to assign antimicrobial resistance determinants from metagenomic data. And so, so these are some of the so these are problems here and just on what that that related to the detection of antimicrobial resistance genes, but also, it doesn't give you information about the flanking sequence that is additionally important. Because it's typically this combination of things that we're interested in looking at, not just the antimicrobial resistance but. But also, whether it is harbored by a mobile genetic element. So you don't get the you really you don't really get all of that context in the same way that you get it from a clinical sequence and so until this is that quite a bit of a problem. And of course the problem that we can't really avoid because of the importance of the fundamental importance of environmental sampling and being able to identify what is the source and the patterns of transmission of antimicrobial resistance along the farm to fork. Okay, so, so we're moving on here to the part on lateral gene transfer. Okay, and some of the problems that we have that so. And then so this has has been alluded to in prior lectures, but the problem stated briefly here is that we have this assumed method of inheritance that is captured in a phylogenetic tree that's the vertical inheritance inheritance descent with modification. So the genes and the content here is acquiring the diversity through this through through this vertical descent with modification type of paradigm here. So when you have recombination and lateral gene transfer, this violates this assumption that we use in phylogenies of having this bifurcating tree structure here. So a gene that may exist in one part of the tree, one sub clade of a given community of organisms can be transferred into another sub clade. And so this is a violation of this bifurcating tree structure that the phylogenetic tree on its own is insufficient for us to be able to capture and investigate these types of recombination and lateral gene transfer events. So how do we, you know, so we need a method to be able to track that but also this is, you know, it said, well it's an interesting, it's an extreme problem that we need to that has lots of important downstream implications. So what are we talking about with lateral gene transfer. It's essentially it's a change of address. So and there's three from one organism to another, and this can occur in three ways. So one is transduction. That's where you have a bacteria phage that can infect a host cell. Normally, so it can go through these processes called phage lysogeny where it can integrate itself into the genome, or it can be an illytic phage where it's replicating independently but it can. And normally what will happen is the phage will just encapsulate its own genomic material, and will just transmit its own genomic material but on occasion, you can have these aberrations. That can occur, where the phage will uptake instead of uptaking its own DNA will uptake some of the bacterial DNA instead, let's call generalize transduction. And then it can. And so the phage progeny still viable, but it contains the wrong DNA and it can so when and when it releases from the donor cell it can go into the recipient cell and inject that bacterial DNA, which is up. Again, it's up taken by the the bacteria, the macromosome, or in a plasma. And then there's also specialized transduction where the through to a combination you can a part of the phage genome is replaced with with bacterial DNA, and that gets encapsulated inside the phage. Sometimes it's it makes the phage is not transmissible but you can have culture phage along that can provide the additional functionality that's missing by that phage and allow it to be able to transmit into in fact, the recipient cell. The second method is conjugation. And so, and this is most people I think some are pretty familiar with conjugation so this is sexual transmission method that requires contact between the donor and the recipient cell. And the, the letter gene transfer is mediated typically by a plasmid that has the functionality to be able to replicate itself to partition itself, and to be able to create the, or the pilots or other mechanisms to allow for the transmission of cytoplasmic transmission of its content from the donor into the recipient. So, the last type of letter gene transfer crystal transformation, where dead bacterium releases its DNA out into the environment, and certain bacteria, not all of them, but certain bacteria, especially if they are in a certain in a phase of their life cycle, can uptake called it can become competent which means that they can uptake that DNA and through a process of recombination be integrated into its own genome. Okay. So why is it so important. And that energy transfer is really important because the genes that are transferring are typically not the ones that are required for viability but those are the ones that are typically hard just in the in the main chromosome. But they're the ones that are important for providing a selective advantage in an environmental niche antimicrobial resistance is a really good example. Organisms that may exist distantly within the taxonomic tree can acquire that new functionality without having to involve it through to set with modification. Essentially you have a donor from from a distant species that can say here, take this DNA and uptake it it has this selective advantage that may help you to be able to give you a selective advantage to be able to thrive in a given ecological ecological niche. And so, here's an example with this and you know like side resistance gene be a yes, that is, that has been identified as having a common origin because it has this conserved neighborhood of genes so you'd be five and you can see that there's, there's, it's not perfect but there's a pretty well conserved neighborhood of genes in that neighborhood in these three different species salmonella which is a trabacter E coli and enterobacter. So, so all of these are recipients of this be a S gene one without having to had to be able to evolve that resistance functionality on its own. Okay. So, ladder gene transfer is implicated in the spread of antimicrobial resistance. Here's a good example of it so this is an example of your senior pestis two strains of your senior past this which is the cause of the vaginal the plague that contain these plasmids that for two different antimicrobial resistance gene so this one has doxacillin resistance genes, this one has streptomycin resistance genes, and the homology of the plasmid in of this plasmid is related to salmonella. This plasmid was acquired from salmonella in this this plasmid over here, its genes have or have similarity. So you shouldn't have said homology, but they have a high degree of similarity to genes that have been found in acidivorex. So, so, so, so, so not just a problem that you can receive it from from one neighbor, you can receive it from multiple different neighbors from across the phylogy or across the taxonomy. And that, so that promiscuity is also a big problem. Salmonella and acidivorex are actually from different orders, I believe, so are different classes actually so the salmonella is a, is it a beta proteobacteria and acidivorex is a gamma proteobacteria. So they belong to the same phylum which is the proteobacteria but they belong to different classes and so that is a pretty long ways away and so the ability to be able to transmit these genes across that in my species gap is quite concerning. And so it makes it because it makes it very difficult to be able to control the acquisition of antimicrobial resistance. Okay, so, alright so we discussed mobile genetic elements in a kind of a general way, but mobile genetic is basically any type of DNA that can move around within the genome, we talked about the, you know, transduction and conjugation, etc., as those mean mechanisms but there is also a different different types of what we call replicons, right, but these are the different types of mobile genetic elements and they can include plasmus, transpozons, bacteria phage elements, genomic islands, there's more and when you compile all of the mobile genetic, genetic elements that are harbored within the genome, that's referred to as the mobile one, I'm not, I don't go, I'm not going to go through and profile all of the characteristics of these different mobile genetic elements, but I do provide them to you at the end of the lecture in the additional slide section, and you will refer to them in the practical, but I'm not going to cover them here. Okay. So, the general combination is important, because it can help us to identify the major modes and vectors of transmission so what I mean by that is that which genes are being shared, how are they being shared, who are they being shared with, and where is that happening and like which, you know, which types of environments like in the hospital or in the community, and why is this important, it's because we can use this information to identify the risk factors so it's important for risk assessments, like which genes are being mobilized, which are not, and also that gives us mechanisms to be able to prioritize our interventions to help to try to minimize the accumulation of antimicrobial resistance. But it is not any, it's not easy, finding evidence for a combination of lateral gene transfer is not easy at all, very difficult problem. And we're kind of in early stages, and maybe I'll just give you the punchline is there, we're not going to get to the point of hearing is that it's a solve problem you just use this bottom of my point okay, but we're making some headway. Anyway, here are some of the clues that we have to try to identify lateral gene transfer so one is that we can find genes that have some phylogenetic trees that don't make a lot of sense in a different contextual level so in this is an example here we have a species tree and we have that a cladogram showing us the species tree on the left and then we have a cladogram showing the gene tree and they're different. How are they different. You take a look at nodes five and six you'll see that the most recent common ancestor of five and six is as it is. The most the most recent ancestor for five to the ancestor five and six is carbers node four. Okay, but in the gene tree five most recent common ancestor of five and six here it also is a rooted tree is is has it well the common answer and six as a common answers were with with gene one, rather than gene four right so these gene these two trees have different apologies, even though they're coming from the same species just looking at a single gene. Right, so they're different. What is going on there. Why is this happening could be because of lateral gene transfer. And as these mobile elements evolve in these different species which may have different GC content and different codon biases over over time, they will have different sequence compositions, and if one gets transferred a recipient or donor transfers this divergent composition of DNA into a into a recipient it may appear. What you see here was as a difference in, for example, GC GC content or in codon bias, so you can look for these stretches of differential motifs, right, and that may be evidence of of lateral gene transfer. And it's unexpected sequence similarity between you know so we get lateral gene transfer essentially what's happening is you're inserting a piece of DNA from a recipient into from the donor into the recipient. So in this example here on the left, we have our donor cell in blue squares, the DNA of the recipient cell here in these red spheres, and then the transform cell has places, a piece of donor cell in between. In between the, the genes of the trend of the of the recipient in a mall I guess recombination, it's not an insertion it's a replacement. So you are replacing one gene with another one and so if you have two genes that are similar, but slightly divergent. Then that. But, but in one genome, but not in the other genome that may that may be evidence of homologous recombination so in this example, this green sphere here has replaced this red sphere here. So, and these ones the flanking genes here remain the same as the recipient cells so we take a look at this and say hey, you know, the ancestor here looked like this, right this descendant looks like this this green sphere doesn't, you know, didn't come from the ancestor it must have come from somewhere else. Okay, so another clue is this kind of guilt by association so if they if you're if the gene that you're interested in is localized to other genes that have that are implicated in mobile genetic elements, then that is a good clue that you have a mobile genetic right so here we have that we have the 10 m gene here some accessory gene, but it's flanked by these genes for conjugative transfer genes for recision excision agents and integrases. So this kind of gives us the sense that this mode that this antimicrobial resistance gene may have been transferred and is mobilizable because of its because Harvard within a larger segment of genes that have the characteristics of mobile genetic Alright. Okay, so how do we find these things. So we're going to talk a little bit more about some of the programs that we can use to try and find mobile genetic elements. So, and but and how do these programs work. Well, they work with these kind of general clues right so because there's a lot of channels around what they lot they can look for in relative to the rest of the genome we looked at that right so different nucleotide compositions. Some genes can be signatures for for for a mobile genetic elements. And we looked at that right so replicases and plasmids transposes as integrases excision agents and there's a whole bunch right virulence factors and amr genes secretion system genes. All indicators of possible mobile genetic elements, and you can use. So you can use this information of this contextual information. You can also do some searches against your reference databases to see if the, if the genetic element that you have is implicated in a previously described recorded mobile genetic element, using things like glass and diamond or you can use things like match. Okay. All right, so we we're not going to go. I'm not going to, we're going to rehash RGI is already been covered by RGI is a very good program excellent program I consider this to be first in class program for identifying antimicrobial resistance genes. MOB suite is a program collection of programs that was actually was developed at the National Microbiology Laboratory by by James Robertson. And so this is a program for working on trying to identify plasmids and to also help to characterize and to cluster plasmids and so it's so it's a suite of programs. MOB cluster. So in it so it has an existing database that has a high quality reference plasma database has been clustered together I think using mash, but the focus is on gram negative enterics because this is that's what we have access to and was and in order to be able to identify all that that reference plasma database they had the, the, the isolates that they had to access to was from the, from our enterics program, mostly Salmonella, I believe, but they did a lot of assembly, including long read assembly so short read and long read assembly of the large number of Salmonella genomes in order to be able to identify and assemble and annotate these high quality databases, but you can also cluster your own set so if you've done the same thing and you have a database of plasmid sequences, then you can use mob cluster to cluster them together into similar plasmids mob recon can help to find and classify the plasmids in your data sets you know so remember the return find like remember the when you do an assembly, you're not going to get can't guarantee that your entire plasmid will be captured in a single context so this will go and look for. Well we'll look for context that are circular and if it can find one though that's a plasmid in typically there's very rare. That you're going to be able to get like an entire circular chromosome right in a single contact right it's more likely that should be able to get a plasmid especially if you're using long read sequencing technologies like Oxford nanopore nanopore look for the for the characteristic. Plasmid genes like the relaxes or replicant genes that makes it a candidate plasmid. So compare it to that high quality reference plasmid database that has been generated. So long as you're, you're working with these gram-negative enteric you have a good chance of being able to find it. And then there's mob type or, which looks at the relaxes and other information to be able to try and figure out whether your plasmid falls into one of these three major classes conjugated with conjugated which means that it has all of the facilities to be able to replicate the plasmid itself and to transfer itself into the to a recipient all those genes are located right on the plasmid. So the, so the pilot's formation for example. And the mobilizable doesn't have the ability to transfer it's not fully self transmissible, but if those genes exist on the main plasmid then it can basically be mobilized right so if if the you still get that may perform it be poor and the pilot's formation and that and the and the the planet can take it the rest of the way. And then it's not mobilizable where it does where it doesn't actually so it can replicate and it can descend into a into it's into daughter progeny genomes through vertical inheritance but it is may not be transmittable through lateral gene transfer. So genomic islands. I also consider the island star group of programs is, in my opinion, best in class. It is developed in Professor Fiona Brinkman's lab and consists of a couple of different modules island path is will look for the pattern unusual patterns of nucleotide composition uses hidden markup models, and also looks for the presence of hidden gene such as mobility genes. Island view will view predicted items that have been previously predicted in large databases of genomes. And so it will allow you to to be able to to view those. And then you have one and you will, it's predicted by island path you can view it in island view. It also will identify the a margins and various factors that may be contained inside of these genomic islands. An island compare which compares the previously identified genomic islands across genomes here. Okay. So the vibrance is a pro page binder, and one of dozens of very good pro page finders, but it's the one that's being going to be used in one of the programs that we're going to be profiling in the lab so it's worth spending a minute or two on it. It's, it's like, go far. Right. So this is a program that uses a combination of approaching similarity and a neural network to be able to predict the viral content of a genome from environmental data. If you have a pro page in, in a genome, but you've sequenced it using a metagenomic approach, the vibrant is a good program to be able to identify that for you. What it does is it uses these reference databases which are databases of orthologous groups genome families so the keg co families are either keg orthologous gene families, P families families as gene families, and the virus orthologous groups is another database that will provide hidden mark off models that can detect whether your gene is falls into one of these orthologous groups. And so it's will take that info we will use that to annotate your, your assembled your metagenomic assembled data and make it look and assign it as a possible candidate viral genome, and then it uses a bunch of these V scores, which are based on the annotations feeds that into the neural network. And that it generates the final prediction so you have assembled metagenome protein scaffolds and nucleotide scaffolds you just assembling the data. And if it's if those scaffolds those contacts are too small and we'll remove it. And then the step it goes through and annotates it with these hidden mark off models from keg keep them and VOG. And so, and if what we say too small means if there's not enough genes on that contact, then it will remove them says these are likely not not a viral contact. It takes the remaining ones so they have so that they're composed largely of of genes with viral viral associated genes, and puts it through a neural network but if the scores too low it will remove those. So, and the ones that remain are the ones that are predicted to be viral in origin. And at the bottom here you have this PCA plot here that's showing you though the how well the vibrant does on predicting the the viral content here so the green here are the ones that have been signed as virus, right. Plasmid are in red and then bacterial are in our can says a pretty good discrimination, a bit of a mix here between the plasma and the and viral but pretty good discrimination between between the chromosomes and viral, like bacterial chromosome context and viral context and okay for plasmids and viruses. And if you just at the bottom of this list of here is a a profite prediction comparison program and so it takes a look at about, I think it was maybe like a dozen different programs, and it will perform the predictions and allow you to compare the predictions. If that's what you're into. So, I couldn't help but insinuate some of my own content into Rob's talk here and so on. And because I wanted to be to profile this with you in the lab section here is what I gave you a brief introduction to proxy. I developed in as a joint collaboration in my lab in the lab of Professor Paul Stoddard at the University of Alberta. So, this is a not for environmental genomes per se designed more for full genomes. But it can be very useful, but it will also take fragments of genomes but the idea here is it's a genome assembly and annotation and visualization system designed to be easy to use you don't need to do have a lot of background understanding to be able to use and one of the reasons why we wanted to develop this is because there's because the barrier to generating bacterial genome sequences is very, any lab can now routinely generate noodles of bacterial genome sequence data where they sometimes lack. Where they find themselves lacking is in the expertise to be actually be able to assemble it and to be able to annotate it and to be able to integrate different types of programs to be able to analyze and then visualize the the genomes that they generate. And so that's what proxies for so proxy will assemble your genomes from raw reads, if you if that's what you want but it will also accept assemble context, and it will also send accept annotated context, if that's what you want. But, but the assembly process will it'll assemble a genome in a fast way using a program called skis and it uses that to generate these qc metrics. And it will also uses use mash to assign the assembled genome sequence to its most likely species. So if it came from Listeria, it will say this looks like I'm assigning this on stereo, and it will compare it to other assemblies for Listeria and say well how well did you do. Right, so it's looking at these assembly metrics I can 50 50 numbers of context, and also, you know, and length, right, so here and it will place yours in context of the of these other assemblies. And so that way you can use that to say okay well this make you know this looks like a pretty good assembly. I'm pretty happy with what I have, and I can proceed now with my annotation and analysis. So there's a couple more different metrics here that you can use although I'm not going to go to details of them, and we're not going to look at the assembly because in the lab section, just because of the time required to compute the assembly so we're going to look at pre assemble genomes but just wanted to let you know that that is available there for you. It reduces the barrier for you to be able to generate and analyze your own sequences, and then you have the programming program here, which will render that annotations on your genome in a circular or a linear view but the default is a circular view where you get a track is great. You have a backbone here in the middle, and then you have these tracks that radiate concentrically out from the from that backbone, and that's where you can place annotations and you can place annotations from different types of annotation programs. We have a bunch of different annotation programs here. And also the ability to upload your own annotations if you have them from external programs that you may have used to analyze your data here. Now, the nice part about proxy is that the, that is extensible, which means it's very easy for us to add new programs to it. But, and so the programs here that were listed here are well the proxies and beta right now, the paper will come out in the summer. But right now we kind of consider it to be in beta, so I have an initial list of programs here and these may evolve over time, once we get some better programs somewhere really good somewhere better for example back double probably use that to replace all the back is kind of like that, considered to be a successor with proxies. But, but one of the considerations for implementing these, these annotation pipelines is the, just how well constructed those pipelines are. Some are so poorly constructed that the, they're essentially uninstallable or their, they, they fail. They're not very robust. And they aren't maintained like they're abandoned. So there's a huge amount of different programs that are available to use but there's a small number of programs that are of the quality that allow you to be able to implement it into a system like this. So, and I'll talk about that a little bit more and actually limits couple slides. Anyway, I just wanted to wrap up here a little bit on this section here so just to summary on these on some of the caveats here for these mobile genetic elements, so mobile genetic elements are highly mutable and the gene content order they can change. That's why it's nice to be able to be able to view it in a genomic context like a circular context. Some databases are better for some organisms like the gram negative enterics, which is there's a whole lot of sequencing done less so and others like grand positives so the comparative methods may be better for some organisms and they are for others. Short read assembly teams tends to fail spectacularly for mobile genetic elements. It's difficult to be able to get these, you know, these beautiful finished circular plasmids, when you're using these short read technologies, longer read technologies are help more helpful for it for getting these larger mobile genetic elements like plasmids. I'm predicting the boundaries of things like genomic islands and other lateral mobile genetic elements is also is is also a problem. I'll look at that in a second. Are we doing for time. Seven minutes left seven minutes perfect. Okay, so I wanted to return a little bit to my rant about software. There's a lot of software development and it's ability to be able to be implemented and so you guys have kind of seen this a little bit before so one so if your program has a million like very. Democratic and bespoke kind of dependencies, it can make it very hard to install into maintain. And so there are these package managers like conda that that will resolve the dependencies for you. So it's nice if you generate software if you can generate and put into a conda package. So there are these virtual environments like Docker. So Docker makes it easy for you to be able to run these packages inside of, say like a high performance computing environment without requiring you to have huge level privileges. But in order to be able to use it Docker needs to be installed in your high performance computing system which I don't think is the case for all. There's a lot of high performance computing environment so I think compute Canada for example doesn't permit Docker singularity is another type is a simple is like Docker it's a virtual environment where you can install and run your programs. But I think it is it tries to instead of trying to partition everything off into its own image like Docker does it tries to integrate a little bit more with the host. Also, you think you can run stuff with pseudo without requiring. Yes, you do. I'm not an expert in all of these different like package managers. I'm not as familiar with the all the advantages and disadvantages I have to concede. But I'm sure that the people some people in the class are in that our TAs are so there's, if you have an additional insight that you would like to be able to provide to the students on Condor Docker singularity then I would encourage you to add that to the Slack channel. And then implementation inside of a workflow execution manager like next flows also is important, especially if you plan to execute if you want to have your program executed in a high performance computing environment. And here's where we get to this program that family row called anti microbial resistance emergent transmission ecologies are heritage for short. This is a kind of a kitchen sink program that is trying to tackle this huge problem of trying to infer patterns of transmission and distribution in these environmental samples but it can also if you have in like environmental samples and clinical samples, but also looking at how the those those distribution patterns occur in genes and mobile genetic elements. And so it's looking for if there's significant evidence of lateral gene transfer recombination and how do you associate these transmission events, for example with different habitats with different geographic proximity. And so this is a genetic related, related to surrounding microbial usage. Here's how look, here's how it looks. Very well, I don't know if this is going to be that brief, but here's an overview of it. So it can take your raw sequence reason it will assemble with unicycle or this is optional and do some QC for you somewhere to or you can take your preassembled genomes and you can perform a specialized annotation for you using our GI mob suite which we covered island path which we covered vibrate which we covered back to which is the successor to product. And it also uses these databases like the virulence finder database is that right thinly vfdb. Yep. Yeah, back net which is database of virulence factor sorry no virulence finder virulence fact virulence finder virulence factor database thank you back net which is a database of bacterial genes that have show resistance like middle resistance so they're implicated in like antibacteriocidal and disinfectant resistance. Yep. And biocidal things in general and there's a lot of code. We see a lot of co selection on plasmids of heavy metal resistance with antibiotic resistance. Okay, right easy, which is that is oh my gosh so that's carbohydrate associated enzyme database. And so the different organisms that well an organism that has a profile for metabolism different carbohydrates can give can imply and give you clues to all the environment that it may be able to maybe viable in. And then iceberg choose iceberg choose database of integrative and conjugate developments. It's a type of mobile genetic element that is profile briefly any additional slides. Okay, so well once you have your annotation and it will assess the pan genome using panorail, and it can do lineage inference using a fast clustering algorithm will put gene order for you it also will take the. It will generate phylogeny for you uses fast fast tree and IQ tree to do that. And then at the end there's a lateral gene transfer coalition recombination. I'm trying to infer these patterns of lateral gene transfer and it uses that using like programs like RS PR, which is that is a rooted does some tree prune and reattach algorithm won't get into the details of it. The whole CCM is for me, oh God, community co co evolution model that was taking a look at pairs of genes or pairs of mobile genetic elements or genes and mobile genetic elements and tries to figure out if they are co segregating through the tree like are they associated in in the tree or do they do they have presence or absence that is that it's kind of random throughout the tree is that about right for evil system. Yeah, exactly looks at for correlations between the phylogeny presence absence patterns of genes. Right, so this basically gives you an association so if you see this one you also see this other right and then gobbins is for a while for whole genome sequence based detection of recombination. All right, here's an example using Eric or or prototype of Eric a to investigate antimicrobial resistance in a group of about 1300 enterococcus BCM genomes. So the. So we're looking at these different components like AMR genomic islands phage plasmids novel is the ones that were detected by by mob sweet and virulence factors. And then these colors are showing you these different types of environments like clinical. Oh, and they're also these are from to disparate geographic locations in the UK and Alberta. And they're looking at so we're trying to look at these. Well, the counts for sample basically is the associations that we have for these different types of mobile genetic elements or other features like virulence factors with different environments like clinical environment wastewater. And this is like downstream wastewater from like near hospitals, agricultural so out on the farm, natural water environments and agriculture wastewater. So, and so one of the things that are kind of interesting to see here is that for that AMR. So the pink and red which is the clinical UK and Alberta. So AMR is is has a higher count here so AMR is our account in clinical and clinical samples, and in wastewater samples. Then it does for other environments and so that's kind of interesting to see and there's this hypothesis here that once an AMR gene makes its way into the like into it is distributing throughout the human population so it's in a clinical environment, then it will adapt to that to that host and so even when we've been transferred zoonotically, or environmentally, once it gets into that, that new host it will, you can undergo these some selective sweeps that will diversify and and create between the ancestors and the ones that are circulating in that population. And it also implies that it doesn't that they become so adapted to that environment so in the, so people, for example, that they can't kind of transmit back into other like the agriculture environment or, or, or, or get like, you know, back back to the, you know, back to streams or into other, you know, farms back into the environment, basically just circulates within humans is an interesting finding