 Okay. Welcome back from lunch, everyone. I'm getting thumbs up that you can hear me. Good. Very enthusiastic this afternoon. Nobody's overwhelmed yet, right? Ready to dive into MAGS. All right, so my name is Laura Securo. I'm associate professor here at the University of Calgary. And of course I'm a microbiome scientist. This is my first time teaching for CBW, but I'm really excited to be here. So I'll tell you a little bit more about my lab. If I can get the slides to go. So I, like I said, I run a research lab. We work on a couple different body sites, mainly the human vagina and the human gut. But we also I'm affiliated with both impact and the International Microbiome Center. So at the International Microbiome Center, I lead the functional, sorry, I lead the microbial genomics as well as the bioinformatics or data science course. So these are fee for service core facilities that help enable the kind of like little boutique core facilities that help enable microbiome science across Canada and at the University of Calgary. So we basically will handle people's microbiome samples from sample receipt all the way through data analysis and publication quality figures. We focus mostly on sequencing based technologies in the core that I operate. So we have a lot of experience with a lot of different sample types, a lot of perspective from all that, not just our own research. And then for impact I lead the functional omics platform. So we are really trying to also make sure that appropriate standards and controls are utilized in microbiome science. We're thinking about our methods from that kind of quality mindset so some of that was sort of percolate into the lectures I give for the course. So in my program, I work on the adolescent vaginal microbiome so it's kind of interesting that it was mentioned earlier on in the at the beginning that we don't know much about the adolescent microbiome and that's certainly true for the gut but we're learning that in the vagina it's actually really interesting and it's a very developmentally unique process of colonization and change that happens in the vaginal microbiome during adolescence and young adulthood. So actually some of that will sort of percolate into the end of the talk, as well as into the exercises because you're going to use some of our data sets from a cohort of Kenyan girls that we have been studying. And so within that we really want to understand what's driving the communities in the vagina to have a dominant tax on so it's the only body site really in the human microbiome where there's a very dominant species 90 to 95 plus percent of the community and specifically one species of lactobacillus and we're trying to understand how that happens and why does it happen with me in this one body site. And we also try and understand how the vaginal microbiome effects risk of STI. We're also interested in preterm birth where we're working on novel virulence mechanisms I mentioned we're very interdisciplinary we do a lot of laboratory culture work on this part of our projects, and then we've been developing a program in the gut brain connection, with disease and neurodevelopment with a number of collaborators, and we use metagenomics really in all of these areas. So, again, a lot of perspective of using these and very different types of body sites different types of samples different types of questions. So, I also have kind of a passion for teaching bioinformatics to non coders. I think maybe because I came to it from that way I don't have a degree in computer science I came with a classical microbiology training, genealogy statistics and then I was like, I want to do bioinformatics to. So, I actually teach a course here at the coming school of medicine at the University of Calgary on bioinformatics for non coders we call it bioinformatics resources because there are so many resources out there now. In cloud based platforms web based platforms. It is definitely a lot lower throughput and can't do everything as fast as you would if you had a cluster that you're working on yourself but I have these little symbols throughout my talk where I kind of highlight how you can access different tools in different ways, because not everyone's going to come away from here and do all of this work at the command line. I can tell you from having my own experience learning it as a postdoc and having trained many students and postdocs. If you're not committed to doing it every day for six months it's going to be really hard to learn like you need that level of commitment. So I do have some learning goals in here that you can circulate back to as you review the lecture and as this is posted online in posterity. And so we'll just come back to the beginning just to refresh our minds why we're here. These complex microbes right found everywhere, perhaps even on Mars, all different kinds of habitats and there's just there everywhere. We need to learn so much about them and yet there are so many of them with a biomass on earth equal to that of plants. Only about 1% and still this is an old number from the 80s and I can't really find any updated numbers out there on about 1% maybe has been culture. So we still have this real lack of being able to understand all this biodiversity to be able to discover new micro organisms and the genes or proteins they encode without metagenomics, we really are reliant on this technology and I think throughout the talk today you'll see how it's taken over even how we approach genomics from a micro organism standpoint. So, I'm not I wasn't able to join this morning unfortunately had something come up last minute. And hopefully I'm not blowing a big hole and Morgan's nice talk, but this is the reality. There's always blind spots and everything that we do in this realm of science, particularly in metagenomics. And when you've thought about and learned about these short read techniques, the reality is that they're 100% relying on reference databases if you're going to detect an element of sequence that needs to be in a database for you to see it. Otherwise, you won't see it, it may be there, but you won't see it. And this is a lot sort of a long standing problem that agenomics we actually call the dark matter, the things in there that you don't see. And so we're going to talk about this afternoon is kind of how we can maybe get around that a little bit by using assembly by by using de novo assembly as a way to be able to see and detect sequences without being 100% reliant on a reference database. You know, I think the thing that's important to realize is that, you know, until recently these databases only contained information from that 1% culture, my organisms, but now that we are doing the process I'm going to tell you about today where we're sequencing metagenomes and bending out these metagenome assembled genomes or mags. Now those are starting to populate databases, it's just starting to happen and it's not even always super easy to tell sometimes whether a database is just cultured isolate genomes or cultured isolate genomes and mags. But that's starting to come together and I think it will be routine in the next couple years that all mags that are good enough quality and all isolate genomes are good enough quality will be together in databases and that's how we're going to grow our understanding of microbiology. So working with these assemblies you'll be able to detect and study novel components of your metagenome with greater detail. It's not that you can't tell at all that they're there sometimes, but you really can't resolve to a fine level their taxonomic assignment, and you really can't understand what new genes or proteins they may code without doing assembly. If we resolve these mags then we can come to a genome resolve understanding of these novel clades, we can really start to see how they fit into the tree of life. And then we can come to an allele resolved understanding of microbial function. I think this is huge right if you think about human genomics and cancer. Everything is a little like resolution right snips and what alleles do you have and that's how we understand human disease and why are we ignoring that when it comes to microbes it's just what genome or bug you have. No, it's what allele you have. It's just that we're not there yet. We are having a hard enough time drilling down into what bugs you have and what genomes they have. But we're getting there we're just starting to get there and I'll give you some examples today of how it's so important like we're seeing it's critical really for understanding. There's been an adaptation in exchange there's been some really exciting things happening in metagenomics in this area so I'm going to highlight a few things on that at the end. All right, so remember, all those pretty pictures all those micro they're multi kingdom right let's come back to this for a second we have to think about this a little bit when we're talking about mags and assembly. We have a multi kingdom operation we have all of these different components at different abundances this is the sort of current understanding of what we have in humans. Does assembly and benefit and Benny benefit studies of all kingdoms. You're interested in fun dryer viruses does this technique work for you. You know what series a little complicated. It depends. It depends how you prep your sample, and it depends, you know how abundant these things are in that sample. You know you can enrich for certain types of organisms in your sample prep and bring them up in abundance, but other times they're just things are too low and abundance to be able to really be beneficial in this method because you need to have adequate control of the genome. And this is a concept that will kind of come up throughout the talk and throughout the workshop coverage. You can just think of it if you have all of the genomic content in your sample right coming from all these different organisms, and you're going to throw a blanket over that you're going to cover it all up. That's one X, how many levels of blankets that you need to cover up all of the genomic content in your sample before this will work. The answer is you need to cover it 10 to 50 times. So if you have stuff in here that, you know, you don't have enough reads to cover all of that content 10 to 15 times. Basically, you're not going to be able to detect it using assembly and Benny approaches. The other thing that you have to think about with this is genome size, because coverage of different organisms in a mixed sample is affected by the size of those genomes. So this is sort of showing you then the range of genome sizes for what we could consider different groups of micro organisms. And so you can imagine that if you're studying the microbial you carry out, you're going to have a really hard time getting enough coverage unless it's really abundant in your sample. So these things interact. And essentially, if you have a larger genome, you're going to need more reads to get that coverage. And if you have a less abundant organism, you're going to need more reads to get that coverage. End up in a situation where you could make it work for everything, but in practicality, do you have that much money to get all those reads. So there are limitations with this method, and where it's really shown is with bacterial genomes they're in that sweet spot of size and abundance that this works really well. But that's not to say that the tide isn't starting to change. We're going to talk in just a bit second about how sequencing is getting cheaper and cheaper and with that. We're just seeing these enormous metagenomics projects, 300 billion. Okay, that's all the reads to work with the ones that you guys are going to have in your, the samples, the metagenomic samples that we've given you in your workshop today. There are a couple million, for example, a couple million. So this is many orders of magnitude more reads that they worked with in this study, but they did succeed. So this is a bunch of eukaryotes. It's pretty cool. This is pretty new. So, you know, it's not impossible, maybe we're getting there, but for the most part this has been applied in bacteria and that's what we're going to talk about today. So I mentioned that sequencing, of course, is getting cheaper. And, you know, we're all really being driven by this, this goal of the cheap human genome right us. So we're talking about microbial genomicists metagenomicists were sort of just being dragged along with this and so you may have seen this figure before where this line more is long. Let's see if I can get a laser from Europe here is basically the cost of computer processing power is decreasing over time as computer processors get faster. And then when we invented technologies leading to sort of high throughput parallel luminous sequencing we really dipped below that sequencing became a lot cheaper. Then we've kind of matched more as law since then, and we've gone to the $1,000 human genome and a lot of people think that we're on the precipice of making this next big drop in parallel to more as law with this new $100 genome technology so we'll see how that goes. The genome is this we're being strung along and hopefully we eventually get to the $10 human genome. And so how does this benefit us as microbiologists. I think one thing that's important to realize is that, you know, the human genome is way bigger than the standard microbial genome, which means that, you know, we could really be sequencing enormous numbers of microbial genomes for the same cost. So if we're at a thousand dollars on the human genome we're at about a hundred bucks for a bacterial genome, and so the cost per megabase is still way way higher. So this is a limitation for us and this really comes from the fact that while the sequencing is getting cheaper and cheaper. The molecular biology kits, the tools for getting the DNA out and making the libraries those are still quite expensive and we have to make one of those for every single library for every single bacterial genome. And this is why metagenomics is so powerful because, while we could be sequencing 6600 bacterial genome every single week on an aluminum instrument, we're not, it costs too much. We can't culture all of these and get all these libraries made the HMP the human microbiome project when they wrapped up about 78 years ago now. We have this goal of culturing and making genomes for all of these, you know, human micro organisms is huge goal. And I was part of one of the labs that had a cultural mix project for vaginal microbiome funded by the HMP. When they concluded this is how many new genomes, like these were new, not just ones we've seen before, brand new genomes brand new organisms 3000. Right. Now they've continued to expand and elaborate on that in the years since but still if you look at databases today, I mean this is what's in them. When you look at cultured isolate genomes, yeah, there's a million, but 75% of prokaryotes. So look at what's important for human health a taxa like what used to be back to Roydy's Doria and now has a genius I don't like to say 114 genomes total. I'm doing comparative genomics with vaginal bugs all the time where I have one or two genomes for these really important species. So we just still don't have the technology and the knowledge and we can't go about this the slow way anymore, we have to embrace mags and we are it's becoming the way of the future. So I'm going to tell you today about how we make these meetings and give you continue to give you a impression of how important they are. There's three main steps that we're going to talk through and then you're going to work through with your tutorial assembly, bending and annotation. This is computationally intensive. The tutorial later that there's very few parts you could actually execute most of it's going to be looking at files that we executed on the scientific compute cluster for you. It really can't be done, not really any of it on your own computer you have to have a compute cluster to do this kind of work. And the other thing about it is, while you can use wrappers and build pipelines that kind of automate this and we've been doing that many others have been. It does require, you know, quality assessment and human intervention which within each of the three steps and every new set of samples can throw new challenges at you with this, because of these dynamics of how many reads how how diverse the samples are but your coverages, things like that so it requires some intervention. Where we're at right now is that steps one and two assembly and bending I'd say they're relatively standardized there are multiple tools to accomplish every step. And they're not that different from one another at this point we kind of have by and large agreement of what kind of is working. But then we get to step three, you know it's really dependent on what you do on the system you're working in and the questions you have, and that's what we're going to throw more options at you and try to get you to think about, you know, what you need to be thinking about with the choices you want to make for your questions, particularly with step three. So that's the simplified workflow that we were able to come up with for the tutorial to give you a sense of the entire process, a sense of the steps the coding that goes into the entire process. I'll show you at the end of the top complicated workflow, trust me this is simple. We tried, but there's so far a lot of steps into the reality. There's a lot of steps in this process. Okay, let's get going so what have we done so far recall of course we've done all this stuff in the white lab we've gotten some samples with complex DNA from different organisms and it extracted that DNA we've made a library, and we've gone on to sequence it. I'm mentioning this again now because assembly the first part we're going to do here is perhaps a little bit more dependent on what happened with your library preparation than some of the other methods and metagenomics. So, just to remind you that when you make a library you take your DNA and you fragment it. You can do that in different ways with sound waves with enzymes. But the size of those fragments affects things with the assembly in relation to the reads that generate. The other thing that affects assembly is the technology you used to generate your reads. Did you do next generation sequencing which we often just refer to as short read sequencing. Did you do third generation sequencing which we often refer to as long read sequencing, or did you do both. So what kinds of reads are best for assembly. This is the only slide I'm going to show you that really puts together these two very complicated processes and technologies that have been developed. There are multiple technologies and platforms for sort of each brand or each type of sequencing. At the very end of the lecture there are links to videos that show you it through animations how these different technologies works if you want to learn more about that. You have some resources. But basically with next generation sequencing, most commonly Illumina or long read sequencing, which is most commonly either Oxford Nanopore or Pacific biosciences. So both of these will work for assembly. People tend to do for shock and metagenomics short read, in part because it's still cheaper. Also, you can get a lot more reads for the money so you can get higher coverage. The other thing about short read sequencing is that you have higher accuracy and that can be a really important for assembly. And this is still the most commonly used method is short read and that's all we're going to talk about today. You know, I just want to make the point also that factors that can reduce assembly anything relating to poor quality. So if you have base calling errors ends adapters you haven't dealt with through your qc. If you have on, I mentioned that your library makes a difference so if your fragments are too small relative to your reads, such that your reads end up overlapping a lot or reading through. So that's good for assembly there's ways to deal with it, but what you want is ideally is your fragments to be longer than the two reads, the two paired reads are in opposition to one another so that they don't overlap there's like space in the DNA fragments between the forward and reverse reads, you'll see why in a minute. And you don't want your reads to be so too short, which is in part why most days, most people nowadays hardly ever do anything less than 100 base pairs. As I mentioned you have to always think about coverage and it's hard to perfectly predict. You can try to predict it and find your experiments. But if something ends up going weird your samples got way more taxa in it than you thought, even with low coverage of those individual taxa again assembly will break and you'll have a more fragmented result. And it is important that you max that you match the assembly algorithm with the type of reads. So some algorithms are for short reads some algorithms are for long reads and there's some that bring both together. And when you do bring both together, you can improve contiguity of your bed of genome. But again, because of the lower accuracy of the reads challenges with getting good coverage the throughput for the longer reads. So that's common in assembly in shotgun metagenomics and I'll show you in a minute why it's sometimes used still. Okay. So assembly in a nutshell. You can think of it, and it is de novo so it's driven purely by the reads without a reference. You can just think of it simply as the overlap consensus model right so you have all these reads you're just figuring out where they overlap and then that overlap fragment of DNA becomes your content, your consensus content. In reality all shotgun metagenomic assemblers use the brewing graphs a slightly different method. So conceptually for the purposes of this course and this diagram, you can they give it that way there will be some links I'll tell you about later where they have lots of videos and tutorials about all these different algorithms for assembly that you can look at later. But if you take your reads you do this overlapping consensus you get the longer pieces that they can be assembled into we call those contacts. So a step further and we link context together into scaffolds, and the way that we do this. It's usually within the same assembly program or same assembly package, but it will map the reads back to those assembled context. And when you have a read pair where the forward read amounts to one content and the reverse read amounts to another content you can infer that maybe those came from a fragment that just didn't assemble well in between there so it kind of helps you to order, and to figure out how all these contigs relate to one another. So there'll be a few more pictures here that will sort of illustrate this so you can then now kind of picture this as this fragmented DNA that you made into your library. You've got the adapters on the end you have your forward read your reverse reading you have this insert size or the distance between them. So these two context, and this read maps to contact a, and this read maps to contact be, you can infer that these would go together in this orientation and so what it does when it makes a scaffold is it inserts ends between the two. So it adds ends into your assemble genome between these context because it doesn't know exactly how much space is in there. So there is a little bit of guesswork that goes into scaffolding but in general it helps still helps you to bring things together and make your assembly more contiguous. So the preview then for your tutorial is that you're going to be doing assembly with meta spades. There are a few different algorithms that are used for metagenomic assembly meta spades is one of the most popular. Another really popular one is mega hit. So you'll get a chance to try this out in your tutorial, or at least see the outputs because this is very computationally intensive. So there are lots of ways to scaffold. The way I just described where you map your short reads back to the assembly is the most common. But there are others. And this is where long reads is sometimes brought into this process to improve the scaffolding of a metagenome. Another thing that it helps is that the long reads can bridge these repeat sequences and then enable you to actually assemble them. So repeats break assemblers but if you have a long read that spans the whole repeat, you can assemble it. The other thing is that remember I said there's a low accuracy or a lot of erroneous base calls in long read sequencing which is why if you're going to use it for de novo assembly you have to really get a lot of coverage and it's expensive and hard to do. But if you're just using it for scaffolding, you can take your highly accurate short reads, make your contigs out of those, use your long reads to figure out this ordering. So then the errors in the long reads don't matter that much the erroneous base calls don't matter, but it's helping you figure out what sequence goes between contigs and it's helping you figure out what order and orientation the context going. So that's where it can be quite helpful to have long reads in addition to short is for scaffolding. And just FYI like these technologies are also continuing to evolve so that now we have four megabase reads that have been obtained on Oxford Nanopore Minion exceeding the ranks, the typical length of a bacterial genome. So that's like a whole genome in one read. Gonna have a lot of errors in it but it's still pretty cool. Okay, so you've assembled your meta genome now what so you have to check out the quality how did it go. And for that we use this tool called quest. There's also a version called meta cost that does some special things but we're not teaching it in the course, but you can be aware of it. I will tell you the number of contigs and scaffolds that you've got the total assembled length, the end 50, which is a measure of contiguity it means it's going to be a number, and you can think of it is that 50% of the nucleotides in your assembly, they belong to contigs or scaffolds that are at least that length, so that like they're longer. So if you use cost and you give it a reference genome, or if you're doing metagenomics this is where you have to use meta cost and you can give it multiple reference genomes, then it can tell you about miss assemblies. But because it's kind of a pain to always try to anticipate or figure out what's in your sample and find all those reference genomes we usually skip this step. But it's something that you can look into if you really want to know, you know, if there's miss assembly is where you want to improve that. So if you want to look at cost, as well as a different way to look at these same metrics through a package called seat kit during your tutorial. So this can be used just to evaluate what happened with your assembly, you can use it to prepare different assembly algorithms see which is performing best with your data, you can tweak assembly parameters. If you have some length filtering that goes into the subsequent steps and you can also tweak that and look at how it affected your overall assembled metagenome using cost. Okay, so it has a lot of really important uses we tend to run it a lot as we're iteratively improving our assemblies. Okay, so we have really gotten through the first part assembly, evaluate how it went and filtering which I'm not going to talk about because you'll go through it quite a bit in your tutorial. What can you do now so I just want to point out real quick while we're going to focus the rest of the lecture on Benny, making Mads. There's still a lot you can do with an assembled metagenome so you can map reads to it and you can look at, you know whether you have changes in coverage in certain regions suggesting that maybe certain strains or subset of the population has lost that part of the genome. So you can also look at sniff calling and do a lila analysis, you know, just with assembled metagenomic contigs and reads without making Max, they don't always have to make Max. And really, the most advanced work these days it kind of integrates all these things together. So the reality is sometimes you're analyzing your assembled contigs from a metagenome. Sometimes you're analyzing your mags. Sometimes you're doing short read metagenomics and you're integrating all of it together to capitalize on the strengths and unique attributes of these different techniques. So don't think that there's only one size fits all solution. Okay, but Max Max have special uses and we're going to, you know, get to the point where you understand that. So how do we make them a mad assemble a gene, had a genome assembled genome. Thank God, but they made an acronym because that's hard to say. So we we just refer them as mags, sometimes I refer to as bins and for the purpose of this workshop I just decided to kind of have a distinction between those two for clarity. Like they can be used interchangeably depending on yeah they're used interchangeably in the field which is kind of annoying. I decided for the purpose of the workshop I'm going to call it a bin when we've just had it spit out by a binning algorithm and we've checked it out and we're happy with it and we think it's good quality and we want to move it into annotation. That's what I'm going to call it a mag, because then I'm going to treat it like a genome. Sometimes your bins, they're sort of useless they're not good enough quality to go forward with. So, to me that's not a genome, but that's just my personal kind of way of thinking about it. In reality when you read papers, you'll see these interchange there's no hard definition line between. But you can think of how we get these as solving a puzzle. puzzle pieces right they have surfaces, they have pictures on those surfaces or colors, they have shapes, and we use this information to sort our puzzle pieces and figure out where they go, fit them together to make a picture. So you can think of the pieces as your DNA scaffolds and the strategy for assembling your puzzle is bending. On the surface of the piece, we are we also are looking for pattern looking for patterns in our pieces or our scaffolds, and the patterns that we're looking for our tetranucleotide content or GC content. Those patterns tell us about phylogeny. They tell us about what goes together maybe in one genome. The other piece is another piece, something we look at which is the abundance. So when we map the reads back to the assembled contigs, we look at how many of them stack up how many layers of blankets stack up on that content. That's coverage. If you had an abundant organism in your sample, you're going to have high coverage of it. If you had a low abundance organism in your sample, you're going to have lower coverage. Remember that genome size factors into that as well. But in general, then you can link coverage to abundance. And so these are the two things that we put together to bin. These are the things the signals that tell us how to put together the puzzle of which scaffolds came from which genome. So we have tools for this. So we're not teaching Anvio it's something we're still in my group still just starting to use but it's really really powerful and well developed. Lots of tutorials on their website. You still have to as far as I understand fully install this in an Ubuntu system, have computational resources to run it. Once you run it you do get some nice HTML and kind of like web based visualization options but it's not really accessible from my understanding for non coders yet. And so then the two that we're going to work on in the class next Max been to a medibat to their very, very commonly used also in in folks by folks in their pipelines. So those are the two you'll work on today in the tutorial. Okay, so this is a little walk you through this little diagram, I mean it's like a little animation of this whole process. So just going to try to get this all the jail one more time in your brain. And you can go to this link, I got this off the web so thank you for them for making it can go to the link and watch it later. So it's just kind of a fun little animation just to remind you of what we have. So we have our mixed community and we extract the DNA and in that there's chromosomes right from each of these different organisms. So we're going to do shock and metagenomic sequencing so we're going to make reads. So different reads, read pairs are going to come from each of these genomes. But the thing is we don't know which came from which right that's the whole problem of medicine, they're all grayed out. So to figure that out we're going to assemble them into our scaffolds. And then we're going to generate this data, we're going to basically look at the information in the puzzle pieces the coverage. To do that we map the reads to the context to look at the coverage how many layers of blanket stuck up. And this is sort of giving you a couple numbers on those chromosomes so it's saying that chromosome five that organism was more abundant than the one that you see in blue chromosome to so that's just kind of like giving you a hint. So you've got, you look at the sequence characteristics so here's like a tetranucleotide table and you're basically enumerating the frequency of each of these tetranucleotide sequences within your contact or scaffold. Okay, so then you have these scaffolds you don't really know which ones which, but through the magic of this coverage, and the characteristics of tetranucleotide. So you can figure out which ones go together. And so you can get all the scaffolds that came from chromosome or organism five together and you can get all the scaffolds that went with chromosome organism two together. And you can then treat these essentially as draft genomes for those organisms. So this is the process. There's some additional steps that often happen after the bending these days, refinement steps, there are a number of tools, and there's general approaches. And so basically, people will run bending multiple times they'll run it with multiple algorithms. And then you kind of iteratively iteratively can aggregate and deduplicate bins to try and figure out like which combination of bins, when you put them together, makes the best final mag, if you will. This takes into account these quality metrics. I'll talk a little bit more about this in a second the quality metrics that are used to describe now not the assembly so not just cost but the quality metrics that describe the bin or the mag. Those go into this process and then sometimes people reassemble or they take just the scaffolds that went into that bin, map the reads to those subset those out and reassemble just those reads together. So that process is an option in refinement. I think that you're going to try out today is in this program metara and it integrates the binning refiner tool as well as some of the quality metrics and kind of does its own thing. And if you're kind of wondering at this point how do you choose like when there's all these different ways that you can put together these mags and refine them how do you choose. So when you look in the literature at a new tool, you'll see often that they've benchmarked it they've studied how well does that tool perform. Sometimes though when it they're promoting their new tool right they're a little biased, but there can be really good data. So this is from the paper that release this tool metara. And it's actually kind of a wrapper. So a wrapper brings together existing tools and makes it easier to use them and integrate them. Oftentimes when people do that too though they implement some of their own strategies and sort of link things together in their own unique ways. So that's what metara does you can actually use metara to do the whole process. I just showed you in the animation, but in your tutorial you're just using it for refinement today. But you can look at these papers you can find data like this where they've compared what happens when you do different banners what happens when you do different refiners. And you can find there's also this effort, this critical assessment of metagenome interpretation or Kami. So this is an organization. And they've, they've come up with synthetic data sets and test data sets and they've had competitions where people try to figure out what's best. And so this is a metara paper using Kami data, which means you know what genomes went into that metagenome and then you're looking to see how well do the banners get those back out of that synthetic metagenome. So you can go into papers and study what how does a tool perform with Kami data how does the tool perform compared to other tools. I'm not going to walk you through these data but these are just some of the data from the metara paper where you know they're basically kind of shown here that you're getting more high quality bins when you do refinement you can kind of just appreciate their sort of more bars they extend further to the right when you do refinement that when you just do bending alone from these different metagenomes from different sample types. So that's the kind of data that we use to decide what to use, but it's not ever a final answer right we have to kind of let time roll on and all these different studies evaluate the tools. And because I like pictures I just pulled a few pictures out of this. I think these kind of these are fun pictures that are sometimes made with binning studies. So these are all of each little dot is a scaffold, and they've colored them based on the abundance. And then the GC content. Sorry, they've arrayed them based on abundance and GC content and the colors here are, you know, high level plates are basically profiler. So you can see that, you know, there's some differences in GC content some differences in abundance among these file but it's still a little bit nebulous. This is the one where they've colored it based on bin. So you can see a lot more discrete coloring the bins really have separated things based on this GC content and abundance which we expect right. This is kind of a fun little visualization of it. Okay. So, we talked about assessing our assembly quality with cost. This is how we assess our bin quality. And this is one area in. Metagenomics and mags where everybody uses the exact same tool. Check him is the go to go to tool for assessing quality. And so its goal is to define the completeness and contamination using lineage specific single copy genes. It's a gene that's found in a genome, and only one time. And these tend to be housekeeping genes they tend to be things that are really important. And, and so in order to do this they've come up with some neat tricks though to make it more accurate which is why everybody uses this tool. First, they have a process embedded in the program that takes your new genome or bin and you can use this with isolate genomes as well in fact it's starting to become standard that isolates are also evaluated using this tool. So you take your genome or bin and you put it on a tree by using universal markers so genes that are in every micro organism really archaea or bacteria. You put it on the tree. And then based on that you define a lineage specific set of single copy marker genes. So it turns out that when you hone this list of single copy genes by lineage it's more accurate for assessing completeness and contamination. So they account for co localization because a lot of these single copy genes that are essential for certain pathways they're found together in an opera on. And so they're not independently telling you anything about whether your genome is complete or contaminated so they account for that which is pretty cool. And then we simply kind of counting. So how many of these single copy marker genes lineage specific are there. And the number that are there tells you how complete your genome is, and how many are duplicated. You're not supposed to have to, we know that. So if you have to, or three or four. That tells you how contaminated your genome is. This bidding is not perfect sometimes you're going to pull in a content that's wrong, and you're going to end up with a little bit of a chimeric genome sometimes you're going to leave something out you should have brought in. None of this is perfect and this tells you how it went. Another cool thing about this is it distinguishes between strain heterogeneity and species contamination. It does that by looking at amino acid identity it's kind of cool. And so if it's, they look at pair wise combinations and if it's amino acid identity greater than 90%. If they consider that strain variance and if the amino acid identity is less than 90% they consider that species contamination. So it's kind of cool. So this is what your output looks like and you'll get to explore this in the tutorial. So you'll get a table. It'll tell you this marker lineage. This is not the identity of your genome. This is just where it kind of fit on the tree. And it actually is going to go in and look right at a node so just because they have two things that are the same here for market lineage doesn't necessarily mean you're going to look at the exact same set of marker genes. Sometimes you do actually it's like this when you did, but not always. So how many markers it found the markers sets then are where they've controlled for this operon structure bit and then it'll tell you how many were missing of that and it'll tell you how many were exactly one. And then it'll tell you how many were 2345 plus, then it aggregates all that information into your percent completeness and your percent contamination. So it's really convenient. You can ask it to deliver these little graphics for you. And here, every little dot sort of side by side bar is colorometrically showing you a marker gene in that bin. It's a green if it's showing you it's there in a single company. Good. It's gray if it's not there bad. And then your different levels of contamination are shown in the blue and orange colors. It's widely used. It's every paper pretty much pretty cool. You can also ask this tool to call up cost, and it'll tell you other things like the length of your bin, the number of context in 50s or the standard kind of genome metrics, you can also do that separately and cost yourself. They in this paper for check in they kind of put together some language of how you can you know use words to describe where your genome falls. And later they kind of redefine that they made it a little bit less granular, and they came up some quality definitions for mags and they call that the my mags criteria and this is the paper that lays that out. So if you want to know how high in quality your mags are you can, you know, look at these resources. So when you go to upload max SRA they basically require you to. So if you're going to put them into NCBI and I'm not sure if the European databases require the same but pretty much, we're trying to get everybody to stick to the same quality descriptions the my mag. So you'll get to try out check in in your tutorial. And like I said earlier, if you're happy with the parameters you've chosen to this point you're happy with how the assembly went happy with how you're bending and refinement went. So if you have some bins that are at least 50% complete, and no more than 10% contaminated you're good to go you've got some bins you can carry forward and study just like you would study bacterial genome. You can call those mags in my book. Sometimes for certain questions though 50% not cutting it we need a more complete mags so it varies on your question what level of completeness is good to go for the next steps. So this is a combination, some people are fine with 10 some people are like that's too much I only want to look at ones that have 5% or less. Okay, I'm done. Two thirds of the way there. How it started I think this is again just a good opportunity to realize what an impact this has had on the field of microbial genomics. This is circa 2017 8,000 metagenome assembled genome substantially expands the trio life and how it's going now hundreds of thousands of mags and single papers. So I think it would be kind of fun I don't know if you guys are all on the slack side but I think it would be fun for those of you in different fields, see if you can figure out in your field, what papers have the most mags posted in the slack side, because I don't have a chance to go look at like what's happening in ocean what's happening in soil but I mean this is human. The second one is actually a blend of mags and reference genomes like, like cultural isolate genomes. But yeah, 100 and 100 to 150 is kind of standard now for what we think of having when we have a lot of machines in the study, you know, and this is pulling everything out of our databases basically to get to these numbers. And then on the second paper though that kind of unified the reference genome, you know, come a long way from each MP right are 3000. This paper showed kind of neatly so in green over here, these are mags, and then in blue, these are cultural isolate so you can, this is a log scale. So our mags every single like database, you know what we have in it are mags that are two to two to three orders of magnitude there's more. More mags by two to three orders of magnitude than there are isolated right like it's just there's so many more hundreds more factors of hundreds more. We can see where they're from in the world so we have a lot of mags coming from North America various Pacific countries in Europe and China, but we're not yet really getting all that genomic information from a lot of other places. Then interestingly here too this is kind of shocking and frightening in some ways. If you look at the species that are sort of called here. So we're not saturating. The thing is, if we take out singletons ones we've only seen once a man we've only seen once in one person. However, we take those out we start to saturate. Do we all have unique species and seriously, like, that's what's happening here is that where we're heading, you know, so we're still trying to figure that out. So there's some variations on the theme. Now this is one that's very common in the field it's not new. But a lot of times in this science we do serial sampling and the idea is, you can sample like the same system or the same person multiple times and then make multiple assembled metagenomes from them, and you can kind of pull those together into one or you can co assemble them. The idea there is that you can get more reads for certain taxons maybe hit assembly sweet spots with resampling. So this has kind of been a trick use, I think, I don't know certain fields may still use it a lot. I don't actually use it but it's something you'll see in this field, serial sampling and co aggregation of bins or co assembly of metagenomes. The other thing I think is cool is high C contact maps. So this is a different way to scaffold or bin you can use it for both. The idea here is that you make two libraries you do standard shotgun sequencing assemble that make a second library, and you trick you molecular biology tricks to bring pieces of DNA together that aren't contiguous. You take a little fragment that's 500 nucleotides long, you take pieces of DNA in a single cell but they're not contiguous. This means that you can link together a chromosome that's in a cell with its plasma, or chromosome that's in a cell with the phage that's also infecting that cell and replicating. So it's confined the crossings are confined to within a cell, but you don't have to have continuity of the DNA that you get into this high C make pair library. So this allows you to have a different strategy for bending and where it's really shining in the literature right now is linking plasmids to chromosomes and phage to host. And now with long reads if you integrate high C with long read metagenomics, you can fully resolve and close strain level genomes from a metagenome. This is the future to right, we can get to the leels, we can get to the real resolution and we can close it we can know exactly what's in a genome. So we can do all of this stuff. Check him of course, it's the most popular tool for quality check him to is in preprint, it has machine learning it's going to be implemented in a very different way it does very different things but to the same end, still in preprint so I think it's still being evaluated and we haven't yet tried it. There's also check V out there and people are using that for viral genes so if you're interested in viruses and check that out. Okay, seven minutes, then get through this. So you have these beautiful mags, and you want to do stuff with them. You want to know what the name of the organism is that's pretty standard right. The way that this has become pretty standardized now is to use this tool GTB TK, and it uses a particular way of defining taxonomy, the GTB we talked about a little bit earlier in the course. I just took this quote from Laura hug who has taught this lecture for many years in the past and she just said it so well. This is exactly the way I feel as well, you can choose to use GTB if you want, but it is a little weird sometimes they rename things and they do it in a weird way sometimes it's very systematic and understandable. But sometimes if your field is used to calling organisms by certain names, they won't show up by those names and takes work to kind of cross line things together. Still, however, it's just very fast and easy to use. It's basically a fully nucleotide identity based system of defining a species. If your genome is 95% nucleotide identity to another genome, basically it's in that same species for certain clades they bumped that up to 97. And then they make sure that your genome has at least 50% of the shared genes with that species as well to before they classify it. I'm going to try this out and see how it works what the outputs look like in the tutorial. So when it comes to annotating, wanting to understand what mags do and their functions. This is again where you have to think through some things you're basically going to take this as like a genome and do the same things you would do to a genome. So all your open reading frames, annotate your genes, and then what I call structural elements like you may want to know about things like transposons pathogenicity islands, places where he has happened or phase insertions, you know, I call those structural elements. But those are things that are in the genome that aren't really the same as like a function or just a gene. So what you're looking for, are you looking for metabolic pathways, functional groups, certain types of proteins, novel proteins or these structural elements. And what you're looking for dictates the tools you use downstream. And is it going to be kind of a one size fits all pipeline or are you going to look for something really specialized a tool that's going to go look in your mags for what you want. So you're going to explore and experiment a little bit with some tools that are specialized more generalized in your workshop. I like to think about the scale so it actually turns out used to be like assembly was 100% the most computationally intensive part of this. And that's maybe still true but now that we're dealing with hundreds of nice attenuating all of them is also very computationally intensive. Now that we have so much information and so many choices in our databases, just unpacking the databases takes more room than assembling a huge medicine. So you end up sometimes getting a little pigeon hole and what you can do downstream as well based on the skillier project and the resources that you have available. And some, some of these licenses for functional groups are also licensed. Yeah, sorry, some of the databases are licensed which can also be restricting in some cases. So, I'm not going to go through the details or any of these, they're all complicated they all do a lot of things. I'm going to show you some of the common ones Rast and egg nog cool thing they have online portals you can upload your assembled scaffolds and your assembled bins or mags to these programs they'll annotate them and either give you a web portal to like look at what they, they spit out or send you tables and an email so there are ways that you can do this through the web, using these tools but like you know this is even just what eggnog does it's it's doing a lot. So, proka and dram are the two main ones that we sort of put into the workshop because they're very commonly use prokas pretty quick doesn't use too many databases, pretty much have to use it at the command line. I think you pretty much have to use it the command line, a lot more databases but it does do a lot of neat things like it gives you more tools for looking at metabolic pathways and modeling and looking at completeness and sort of really what what is the aggregate function of this genome which is tough to do but it helps with that drum. I like enzymes a lot we do a lot of work with kzymes and proteases in it specifically goes out and looks for those and most other tools don't so I like that. And if you're into viruses it also has some cool tools for annotating viruses. So in the tutorial you'll just get a chance and you may not get through all of it in class but we just wanted you to have the opportunity to understand how many options there are for this last downstream parts to try out and compare them. So I'm comparing these three for sort of functional annotation. And then I just want to close with just a few fun examples of like what you're getting out of all of this. So the first two examples are coming from my lab and I'm just going to go through them quickly just to give you some context of sort of what you get out of all of this and how sometimes it's still very challenging. And then you can get novel plates. Absolutely. So this organism Fannie Hassea it's a really prominent vaginal plate. It used to be called adipobium vagini we knew adipobium vagini one species. And we knew it was important right. Well I started doing metagenomics on vaginal samples and low and behold I start to get all of these different colors here so you can kind of see the blues are here. The greens are here and the reds are here so this is a pan genome map. And when you see that kind of pattern it means they are not the same species. You put together the, these pan genome maps using identity thresholds. And this is a very low identity threshold, 73%. So basically, these were all called Fannie had or adipobium vagini, all these green and blue ones. But they weren't, they clearly were not all the same species it was very confusing for a time. Then I had these pink ones in my mags that I didn't know what they were. So we went on and we eventually showed that indeed there are three species in this genus and we never knew it. The reason we never knew it is that the 16 s identity is over 98.4% full length. And we didn't know that these are different species until we had the mags. So now then we went back to these same samples that we got these mags, we cultured this thing and now we're writing up an ISMI paper and naming it. So it helps you figure out and discover whole new bacteria for sure. One little note when you do GTDBTK. If you have a novel species this is what you get you get s blank. Bells and whistles, you've got something novel, it's just an s blank. Or if you had G blank, you'd have like a novel genus. But I think a lot of times it kind of makes up a genus name actually, but if it can't figure out the species if it's truly a novel species you get s blank. It's cool. Okay, I'm going to go over by five minutes I think we'll be done. So, in the vaginal microbiome space, like I mentioned adolescents we have something really unique that happens the tissue changes hormones. Rise, right? Puberty. Puberty happens. What happens in the vaginal epithelium is it thickens and it gets glycogen deposited. And we've known since the 1920s that that's when lactobacilli colonize. So we always say, oh, human microbiome forms and is developed by, you know, early childhood. No, not all body sites, not all situations. Some of it does not get colonized and develop until later. It doesn't need to be till later, hormones change things. So we've known since the 1920s these lactobacilli these big fat purple gram positive rods colonize the vagina when this happens so it's always been thought that the glycogen feeds them. The tissue thickens the glycogen is deposited by the host and it causes the microbe to colonize by feeding it. We've never had proof of that and we've had trouble proving it because if we culture these bugs, they don't grow on glycogen. Well, that would be weird. Why does there glycogen and vivo and why do we think this yeah. It's a big conundrum in the field so for a long time people thought that the host amylase was required to break down that glycogen into multidextrins and the bugs would eat the multidextrins. It wasn't until 2019. All this genomes. It wasn't until 2019 we finally realized, oh, yeah, some of these bad animal bugs do have a polyulonase, a polyulonase can de branched like a gen. Some of these bugs can eat like a gen. I don't even know for sure why it took us so long to figure that out but that's microbiology for you. So we wanted to ask, do they all do this in vivo do they all eat glycogen and vivo because what that paper showed is only certain cultural isolates could encode that polyulonase and break down the glycogen could they all do it in vivo. So we sequence we did a pilot this is in the samples you're getting in your tutorial came from this pilot which is why I'm talking about it. So we handled them with shock and metagenomics we ran metaphor and it looks at the composition and this is what we got. So these yellow here. This is lactobacillus prospatus and what people are showing now from all over the world, this is the one that tends to be the first colonizer we think in girls at puberty. We don't have a lot of data to say this is always the one that colonizes, but it's always the most common around the world. So this cohort of African girls, 65% and these were older girls like 17 most of them about that. Most of them still had lactobacillus dominating lots of cells for spot us. And this is the one that makes the pool you'll notice. It's really the one of the few that does. So we wanted to ask do they all have a functional feelings we wanted to use metagenomics to do that. So we assembled, we made nice. We also looked at the assembled context and we did this with cultured isolates for alka spot us and these metagenomics and we actually took metagenomics not just from our Ken Kenyon girls are 17 pilot samples. We also took these metagenomics from other published databases. And what we found is there was very high rates of functional and activation so 25 to 30% have a mutation in the allele that we've assembled. Or they have lost the gene. And when we look at how they're losing the gene or how they're getting a mutation what we find is there seems to be 1000 different ways to mutate or lose this thing. Okay, so it's so important why is it getting mutated so often right like it's kind of a conundrum. We wanted to make sure that the metagenomics wasn't causing a problem with this. This is kind of what's neat about this we have these assembled. We have these reads, we can map back to a closed reference genome and we can see we have different coverages, but when we look at where the pool you'll notice would be indeed there's a gap. We can use both the assemblies we can see some of the gene is missing the map reads back to a closed reference and see a whole confirming that it seems like it's actually lost. We actually did PCR or two, and it was 100% congruent we could amplify the pollinates from those we could detect it with metagenomics and we couldn't amplify it in the others, which was kind of interesting because we think often of these mags as being a consensus of different strains, but I wasn't able to amplify a sub dominant strain with a pollinates from the samples where they had seemingly lost it. So this is telling us again that we are really understanding a leolic biology with with what we're getting with our mags we also link this to function and I'll just skip over it pretty quickly here but we can link whether we see a detectable functional pollinates with the activity. And then with the benefit of mags to is we have the whole genome so we can look at the biology of the genome, and we can overlay that with whether these genes are certain genes are lost or mutated. And I just want to show that in certain branches of the El crospatus tree, there was a higher incidence sort of a mutation, and in certain branches there was a higher incidence of loss, but none of this was geographically linked. So there are some really interesting things that you can do with metagenomics. There's pretty much stop I have a few other papers in here on the adaptation exchange and if you want to discuss it when we're kind of walking around doing a tutorial, feel free to discuss it with me or read on your own. I just think it's important to realize that, you know, horizontal gene transfer and genetic exchange. We've always thought that it's not a good thing to study with mags and this is one of the key papers that showed that there's some limitations with it it's hard to bin those parts of genomes properly. But there are some really cool papers like we can still learn a lot about mobile elements and like what's going on with them and so you can explore these figures and papers a little bit more and I'm happy to discuss them with you. And really because of that I thought it'd be fun to add in a tool at the end of your tutorial, where you can explore predicting mobile elements this is not necessarily the best tool for this it's a brand new one. It was an accessible one because actually this is the process diagram for this other paper. Didn't want to give you that, of course, yeah. Lastly, I'll close out just by saying that there are also lots of mags and databases so you can do comparative genomics with mags through IMG. If you're not a coder and you want to publicly available mags, they're in databases like this and you can integrate them into comparative genomics through these databases. And there's a tool called K base with a nature protocol that came out last year I think. And you can run all of these things that were teaching in the course through a web portal on this tool so if you're not going to be a coder, you can still do a lot of this just a little bit slower pace. And here's your links to the sequencing text. So thank you so much for your time sorry went a little bit over a bit. Thank you.