 So what I'm going to cover is, you know, talking about now shotgun metagenomics and how we annotate that so that we get both tax not of profiles and functional profiles. Okay. So just to do my one slide comparison between, you know, six-ness metagenomics, I mean, people usually consider metagenomics better, I mean it is more expensive, so more expensive usually is better, but there actually is other pros and cons just besides, you know, you get more data. So obviously we talk about, we know it's six-messes, it's targeted sequencing, I'm a single gene and from that you basically get tax-nomic profiles, it's really well-established, it's relatively cheap, about $20 for about 50,000 sequences per sample right now. And another benefit that often doesn't mention is it only amplifies what you want. So determining on what you're doing, you might actually get host contamination which can be a problem with metagenomics, right? So if you're in a, say, biopsy samples or maybe even skin samples, if you do shotgun metagenomics which is basically the sequencing of all the DNA in the sample, right? So it's a single gene, now you're getting, you know, the metagenome, all the DNA from that sample. And all the DNA sample means, yes, there will be a lot of microbes, but there can also be, you know, host DNA. And since only a few cells, we have a lot of DNA in our cells and microbes, it doesn't take much to actually contribute to that contamination, right? So for example, I'm going to show you a little study we did where we actually really wanted to still do shotgun metagenomics and biopsy samples. So we sequenced the snota of it and then 95% of our data was human, right? So the only way to really get down around that is to do sort of a system test right now. But of course there's benefits to metagenomics as well, there's no primer bias, so this idea we talked about different variable regions, you have to pick a variable region. That goes away because in theory with shotgun metagenomics you're getting no bias and the DNA you're getting back. Another big benefit is that you're actually getting all the microbes, so if you want to go beyond bacteria, you don't need a different primer set. So with primer sets, we often use ITS, perfumge, or made archaea for the specific 16S PCR primer fit. But with metagenomics, you get a snapshot of all the microbes, you get eukaryotes and viruses. And then obviously the other big thing is that now you get a catalog of the reads that are contributing to genes, and then you can annotate those genes with functions and then ask not just sort of who's there, but also the question about what are they doing, right? But both have their advantage and disadvantages. Okay. So just to sort of give you a little bit of taste, I got some little bit of science for what I'm doing before I get into the technical details of how we annotate sequences, I wanted to highlight this study, one because it talks about 16S metagenomics and then it also touches a little bit on how we're using random forest machine learning in our lab right now. So it's a little bit of a teaser for, I guess, the second half where you'll be learning more about machine learning. So basically in this data set really quickly, we're looking at pediatric Crohn's patients, so these patients weren't treated yet, so they're treated naive, and we took biopsy samples, and as I just mentioned, we wanted to do shotgun metagenomics on these patients. So we have shotgun metagenomic sequencing, and we also did 16S sequencing on them. And then what we could do, which is how the project actually started, is we took this 95% of DNA we were just going to throw away, and we actually called SNPs with it, which was kind of nice. It's a nice side thing to be able to do this, so all from one sample now we have SNPs called for human genetics, we have the shotgun metagenomics, and we have the 16S. And then we basically collapsed most of these SNPs into what had already been made, called the genetic risk score for IVG specific, so this has been associated already. So from this risk score, you basically get a single measure of your risk for developing Crohn's disease. And then from the 16S and shotgun, you can basically get these tables, right? So we talked about OTU tables, we can collapse these things into your different taxonomic ranks, at both for metagenomes, or you can classify these into, say, functional things. So when this is done, we looked at cake, and I'll talk about other functional databases in a couple of minutes. And then with the OTUs, we looked at, again, all the taxonomic ranks. So we looked at all these tables, and then we also looked at alpha diversity, and we also inferred actually functions from the 16S data. So we had all these little tables, and we actually asked, you know, what is the best for predicting one who has disease, who has the disease or not, health and controls. And then also we had follow-up to this, we know that half our patients basically go into sustained remission, and the others don't. And so we wanted to figure out, you know, based on a baseline microbiome profile, because we figured that out. I'm not going to go into machine learning in depth, because that's going to cover really well. But the idea is we have these feature tables, which again could be O2 tables, use tables, functional tables. We have labels, and so we're testing two different things. One, if they have disease, and also they're treating outcome. And then for this, we're just doing leave one at validation. So the first question we found was basically for classification of CD versus control. Basically what this shows is, for each one of those tables I just mentioned, the accuracy that we get from the random force method and the stars just indicate significance. And then you see the 16-up data, in fact, in the MGS data and green in the genomic data. So what you see right off is that, you know, even though we did all this many genomic sequencing for quite a high cost, or 95% of the day it was human, we actually didn't get significance at all, except for the genetic risk score from the human SNPs. But we did get about 84% accuracy at the genus level from the 16-up data. So, well that was interesting. And then what else is nice from random force, and some other machine learning methods, is you can then get what are the most important features for that prediction model. And so this is just a ranking basically of those different features. And so if we just zoom in at the top, you get indication of some of the bacteria at that genus level that are either more associated with Crohn's disease or lower in Crohn's disease. And tell me it's been previously associated. Yeah. I mean, for the random force, do you feed the whole tax elements? We feed it, in this case, we just tested each of these individually. Yeah, so we're testing each one of those tables individually. But then we did do the combination as well. The sum of them would be quite the same. Yeah, absolutely. So obviously the accuracy is quite related. And so when we look at, say, the file features, which is quite higher, are obviously related to the same genus features. So we do see that they're obviously quite related. And then also we did this idea of just combining the data. So instead of doing it separate ones, if we start to combine those together, and does that give us a better predictor? The better predictor is not so much right now. We're still sort of testing it out. But it's like, we made a guy next to 1% or 2%. Accuracy. But then what's nice is you can just sort of relative amounts of accuracy and how much those contribute. So you see Nefilia. This is from Acomensia species. It's the most abundant. But then, obviously, that's within this file. So those are quite related. But then you see this particular K-Gorza law group, which is a functional gene, and how that's related. You see genetic risk score, which is the human-genetic component. It's one of the least important factors. So it's kind of interesting to see how these features come together. Anyway, some removal analogs. I don't want to spend too much time on the science. But the more interesting question for us is whether if we could predict who's going to respond to the treatment that they were receiving or not. And so it is now, you know, the metagenomics started to come and play a little bit more. The genus level is still slightly higher in the 16S data. But now, suddenly, we do see a class and file level from the metagenomic data significant for determining who would go into the sustaining mission. And then we look at the most important features. What's even more interesting is that all of a sudden, the things that you wouldn't pick up from 16S data, like viruses or microbial eukaryotes, pop up in our interesting features. And so we're using that as a way to sort of pull out interesting taxa. And now we're starting to actually do more of viral stuff to understand how this comes into play and focusing on one another. Okay, any questions with a little science, Vignette, before I move on? So technical details, yep? Yeah, sure. Then if you're going to cover this section, perhaps, so how much does 16S have in terms of genomics? Not in terms of this situation, but in terms of the 18K? That's a good question. We didn't do a real rigorous test on that. Other people have looked at that. We looked at them on, like, say, a PCA clock. They'll be really separate. I guess I can't comment on, from this study, we didn't actually try to merge and say how well does it correlate. So I can't give you a correlation score, but other people have. It's not super satisfying sometimes. It's not quite the same, though, guys. But I can't give you an actual answer for that. So when you're comparing the 16S on metagenomes, what is the depth of sequencing for metagenomes? For the metagenome, so we can ask? We did about 100 million sequences for the metagenomes, and so we were getting less than 5 million sequences after filtering for human data, so you've got 5 million metagenomic sequences. What's the average depth? The average sequencing depth per sample. Sorry for what? I'm just wondering what's the sequencing depth for every sequence? Yeah, that's what he has, I think. This is the same question. There's 100 million sequences per sample we get sequenced. 95% of those are mapping in human, so left us about 5 million interior micro-metagenome reads. Is that what you're asking? But you don't have to take a group first. One more thing. You can calculate the sequence coverage. The depth of coverage is like 20x, 30x times the sequence of base blood sequence, which you can quickly do it on the sequences. I see, so coverage of the genomes and the metagenome for the microbes? Yeah. We didn't calculate it. But it's just like one simple calculation. I'll be having some stuff about that. Okay, so I guess to back it up after giving a quick jump in the science. So I'm going to start with the question of if you have metagenomic reads, the first thing you want to do, just like with six nets, is you want to find out who is there, right? So what taxonomy are there? So we want to find the relative lengths of this data. And so there's problems associated with this that makes this difficult, right? So the reads are obviously all mixed together. We don't know. That's the fundamental problem. We don't know what reads came from what cell. The reads can be short, so that's problematic as well, right? So we're looking with 100 to 100 base pairs before people used to work with 50 base pairs. A lot of gene transfer can obviously be both ancient lateral gene transfer and more recent lateral gene transfer and really sort of bugger off taxonomic profiles, right? Because if you think this read comes from a gene that is in this organism, but that it's recently transferred or transferred a while ago, then that can make taxonomic profile hard. And with most of this data, metagenomic data is fairly large, and so computational time can make things difficult if you're doing a loss in the landing searches. So I try to... This is always kind of hard, actually. Two broad approaches, I guess. So what is this idea of binning or assembly? So we're using all of your reads to then do a taxonomic annotation, which I'm going to complete the project and see what you can talk more about methods from that angle. And then I'm going to talk more of the idea of different taxonomic profiles from sort of marker-based approaches, where you're using particular markers within that data set to figure out the taxonomic profile. So just as a one-on-one slide, I guess before I try to look at this in more detail, but from the binning-based approach, there is competition-based methods, where you hear a little bit of using K-mers or you can use like a sort of GC-con percentage. And 90-based classifiers that do this. So competition-based methods are fairly fast. You can also do sequence-based, which is then you're doing some sort of blast or similarity search method to a large database. And so common tools like Kraken or other approaches. And then from that, you're using the best hit or some sort of lowest-cam on the ancestor. And then suddenly, basically, you're covered by Frederick completely. Okay, so the other approach is this marker-based approach, right? So in almost the simplest case, you could think of just taking the 16S reads from your management profile and just typing them through sort of chime way we would do it normally. And so that's possible, but then you're using one single marker and you have all this data, so it seems quite wasteful. The other approach is to basically use several genes, right? So the ideas are picking genes that are well-conserved, fairly universal. And so that approach is taken by, it was taken by Phyllis Sift up there in Darlin, 2014. So they use 37 universal copy genes. So that's like taking 16S plus a whole bunch of others. And then the other approach is to look at maybe clay-specific markers. You can see them everywhere across the tree of lights. They're genes that are found only, they're taxonomically restricted, so they're only found in a certain subset of species. And then if you use that more like a biomarker approach, then you can maybe identify your attacks in that way. So that's the approach used by Metaflan and Metaflan2. And so I'll be going a little bit more in detail about Metaflan2. So I don't think I need to get into their fairly fast submitting approaches. Tend to be more computation-intensive because you're dealing with all the data. And so that involves either assembling all the data or doing similarity searches across all your reads. And then varying genome sizes or LTT can actually sometimes bias those results. With marker-based approaches, yeah, there tend to be faster because then you're only searching across a few markers. But the downside is that you don't actually get to sort of bind them together. You can't do anything like genome reconstruction where people are trying to pool genomes, edit their meta-genomes. And obviously with the marker-based approach you're really reliant on how good those markers are. And so that's some of the downsides of the marker-based approaches. So both are really relevant. Okay, so the tutorial that I'm going to give is talking about Metaflan2. I have nothing to do with Metaflan2 working on, too. I feel like I'm a fanboy, either. I haven't developed any of those methods, and it's what we use basically in our lab. They satisfy our needs fairly well. So disclaimer, I'm not pushing my own software at this point. Tomorrow I'll do that with Python. But not now. Okay, so why Metaflan2? As I mentioned before, it's relatively fast because the database is much smaller than with markers. It's nice it has markers for bacteria, archaea, acrobia, eukaryotes, and viruses. It's fairly widely used and tested, which is not always the best disclaimer for using a tool, but it's nice because it's less likely to have bugs, I think. You could probably use abuse, maybe, with how much it's used. Quite a few tools to have compared to Metaflan2, and so it's, you know, it's always up there in the mix, right? So if people come up with new methods and compare, they're usually better because that's how you get published, because otherwise you don't get published. But at least it seems to be in the right area. Like, one below us. And as I mentioned before, the main disadvantage is that not all the means are assigned a taxonomic label, which means that you're heavily dependent on Metaflan2 having all the markers of things you care about, right? So if you had, you know, 20% of your sample dominated by something that, you know, sort of never seen before, you wouldn't know because Metaflan doesn't even pull those things out, right? So you don't know about how much the percentage of things that's not there, right? Metaflan2 is these clay-specific markers. There's a lot of them. And the nice thing is that you can identify down in the species level and even sometimes in the strain level, right? So people talk about this a lot. It's another advantage I didn't talk about between 6-nationomics. 16S, you're going to get down to maybe genus-pre-reliably sometimes species, you know, but not really reliably with names. And with metagenomics, in theory, you can get down to the strain level. Although in practice, that's also pretty hard sometimes. But at least the data is there. In theory, you can get down there with, you know, you could have 100% similar 6-nation sequences and those things could represent 6-nation extraneous. There's just no more there's no more resolution there. But with metagenomics, you can get down there. The next thing is also relatively fast. So you can run Metaflan2 on your sort of desktop computer So the whole idea here is with this marker-based approach you're basically looking for, you know, genes and how they're distributed across your taxonomic or phylogenetic tree. So this is just an example of a core gene and clade-wide, but it's not unique to that clade, right? So it's also found other places. That's not a very good tan. Whereas what you're looking for is a gene that's unique to this clade that's not found anywhere else. The thing you might jump to in your mind, you should probably, is what happens when, you know, there's a genome out there we've never sampled and it's found over there, right? The way they sort of get around that is that they don't depend on just a single marker to call something an organism. They'll use multiple markers to hopefully make this prediction more robust. Most of this selection of these markers are all done offline in this thing they call chocoflan. And then when you actually are getting your reads, you're basically just taking your reads, you're actually not assembled. Although in theory, you could assemble one first. And then blasting them are not really using BLAST, but usually using bowtie2 as a nucleotide search against their database markers and then you're getting out relative to witnesses. I think I just said all of that in text, perfect. Okay, so that is about metaflan2 and now we're going to talk about functional composition stuff. Any questions on the tax of assignment idea? Yeah, quick question here. Yeah, it's kind of weird. They have at least too many flans for me, but under my understanding it's actually implemented by metaflan under that package. But the thing is they wanted to call it something different when they came up with it. Yeah, it's essentially by default when you get strain identification using strain plan. That's my understanding. That's the most different thing. Any other questions? Yep. I'm not sure about the accuracy of the in the metaflan, my different natures can vary because metaflan usually is developed with the human microbiome. So if you're using a human microbiome or other natures, I'm not sure it can be different in accuracy because of the reference database. Yeah, that being fair, I have seen them do quality accuracy against other data sets, not just human. It definitely has been sort of more angered in human where there's really good reference genomes. It kind of lends you towards this micro-based approach in the beginning. But it's a fair assessment, absolutely. Okay? Okay, so moving on beyond tax upon it, we want to figure out what are these things doing and what functions is different across communities so this question should ask sort of what do I mean by function, right? I mean people talk about function and it's not very precise. They could be talking about things generally like photosynthesis, nitrogen fixation, glycolysis, carbohydrate metabolism that's very loose or they could be talking about specific gene families, very high gene families. They could be talking about a particular gene with the nitrogen fixation FH or they could be talking about a particular enzyme that's well known like I'll call it the i-ray or the p-ray kinase which would be involved in p-ray synthesis but it's a specific gene, right? So what you'll find is when you're doing this annotation usually you're doing an annotation at some gene level because it reads map to genes and then there's usually a step after that where then you're taking those gene predations and then mapping those into modules or pathways depending on the database. And boys the database is out there. This is only a sampling and sometimes I add to this and sometimes I take off. So people have fired about COG it's out there for a long time. The classification was originally made in 2003 but people still use that as an annotation system because it's quite convenient to map still reads to that system. SEED is the annotation system used by RAST and MG RAST but they have their own families. PTAN has been out for quite a while as well but they do sometimes have gene length approaching families as well. K is been super popular even though it went private now four years ago I think at least. It's popular because if you've ever seen a metabolic network usually on a poster somewhere typically it's often been K in the past so they have this really rich annotation for their gene families and they map those into these modules and pathways and then the other side because if you get a hit to the K you're going to know sort of what's doing and you have rich annotation around that. Metasite has been really up and coming and I think it's starting to finally maybe replace K a little bit so Metasite is similar to K but you can map genes to modules and pathways as well as Metasite. It tends to be more micro-focused which is nice and going more requiring a license I don't know if that ever came through or not I think they're still completely open from the last time I checked but hopefully they still stay open in the future otherwise and then Uniref is this other approach where instead of a rich annotation system this is just the comprehensive search so with K you're restricted to genes that are well annotated Uniref really covers many families, millions of gene families and they're basically just clustered into these 100% identity, 90% identity or 50% identity Uniref's 190 and 50 and so it's nicely updated and so basically we're looking for a really rich searching for everything your chance of getting reads back into that is the highest. Okay and so this is my short second vignette taking away from the technical details a little bit so people have probably seen these plots before from the HMP paper where they talk about taxonomy variation compared to functional variation and then the take-home message really is that usually well functions are more conserved or more stable compared to taxon so what's interesting this is at the file level and then these are like high level keg pathways right so the comparison is maybe equivalent because file is a pretty big group but it's really hard to compare taxon and function and so we sort of look in this little bit deeper to say okay is this a really product of just the database that you're using is it just because it's a keg database or is our functions always more conserved and so Carl actually worked on this here in the front row you can ask him all the stringy details if you don't mind it but what we did is we just took 10 HMP gut samples, we applied it basically the breakers to similarity between all pairs and then just plot that we found that like you'd expect that if you go down to a species or a string level that you start to see more variability between your samples that makes sense and then we compared that to keg but sure enough we sort of repatriated that figure I just showed you where sure enough pathways are much more similar to each other across those samples and are even more conserved than in file so that's what we found but then we want to ask one is if we use a more comprehensive database and then so when we map the universe 59 year owners basically now also we're getting a lot of variability so is the functions more stable and taxive I don't know and I don't think it really matters right I mean I don't think it comparing functional stability versus tax stability is they're two different things right so this is actually a more philosophy oriented paper that's in review right now but they take no message really for sort of practical sense of this right is it depends on what you're looking for so if you're looking across samples you might have to go into the variability of these uniref protein families to see the variation that you're interested in or maybe you know maybe you're looking at the different types of microbiome samples and so the differences you see in keg pathways might be enough especially if you might take a message there's just descriptions about keg versus uniref a little bit more and as I mentioned basically there's a lot more uniref entries compared to keg and then as a product of that a bit is that most of the uniref families are unique to a certain number of taxa and so they're not shared across a lot of taxa yep in the back this could be a very basic question but I'm wondering about the functional categorization is it based, is it looking at sort of similarity of a sequence of a particular reference sequence where that is based on that because there's a lot of mistakes in the annotation of the sequence yeah now there's annotation problems the whole other problem even on top of all that so this is kind of looking at the comprehensive of the different databases right so you have some databases that are well created are more well created but are limited so keg has like 15 to 20,000 protein families and uniref is in like the tens to millions of protein families comparing because if it's a sequence of a protein of unknown function how is it comparing to any it's telling you that even though you might not have an annotation for some of the uniref 100s a lot of them are unknown function but at least the idea is that you would know that this unknown function keeps popping up over and over again so if you limit yourself to say the only to keg right and you're studying healthy disease whatever you want to choose you don't even get to see those families see if they're different in the first place whereas maybe if you include a more comprehensive database yeah you don't know what it's doing but all of a sudden you know it keeps popping up over and over and over and Crohn's disease or whatever maybe that protein family is worth following up and annotating somehow pretty sure that keg uses some form of side box yeah I think people can actually do it in different ways great okay so moving on towards sort of annotation system and then there's lots out there so there's web based ones like sort of bigger servers where you just upload your data ebi has their own management server mgrass has been around for quite a while img as the joint genome institute has their medical service where you upload data wait infinite amount of time no I don't know it usually takes a while like seriously a long time sometimes and it results it's a bit of black box probably not my number one choice Megan's been out for a long time for web interface GUI right in front of you not to a server it's usually based on you doing some sort of blast search and then feeding that into Megan but the nice thing is it gives you some nice visualizations and then there's lots and lots and lots of local based approaches here that they come and go so many of us are based off a sort of assembly configurable pipeline seems a little buggy I haven't seen that data for a while Shotmap came out from Tom Sharpton's group a couple years ago which was nice this is counts by means per kilobase so I'll talk about this in just a little bit more detail in a second Kraken also came out which allowed this fairly rapid searching so it focused on being able to take a custom database of genomes and then allow your search fairly fast so that they meant it will be covered by Adrian and Hema too is what I'm going to talk about in the tutorial and go into a little bit of detail okay so why again so why is this so complicated right there's a whole bunch of reasons why can't I just blast against NR so one is you could do that but chances are there's going to be a few problems so one is that it's going to be too slow so if you take your managing them with 10 million sequences and then you try to blast NR with I don't know how many sequences enough to an NR but it's a lot you're going to wait for like forever even with your super computer it's going to take too long so one approach is that you use some sort of speed up techniques so we're just going to talk about the assembly based idea pair-wise sequence searches Diamond came out a couple years ago and you know sort of in the 5 to 10 thousand range speed up over blast so that's one idea the other idea is your top hit might not be the real thing and so how do you get around that maybe you start to wait all the hits that are significant and so you wait the top one the most and then you take that into account you had put a bias from the gene length so your chance of just hitting a longer gene is larger because if it's longer you have twice the chance of hitting it if it's twice as long and so you have to take that gene length into account when you're trying to then contribute, calculate your relative advantages and so that usually gets resolved into this reads per kilobase and so that's how they normalize it unless your metagenome is quite restricted taxonomically or your metagenomic sequencing is really high you're basically not going to sample all of your genomes to completion and even if you do there's chance that you're still missing genes and so you're going to miss genes because your sequencing depth is too low and then how do you get around that if you're trying to figure out pathways and so one approach is to sort of do this gap filling approach where you say well I have 9 out of 10 genes in this pathway chance that we probably just missed this we'll say it's there we'll try to bump it up to some sort of level and see if it completes that and then the other thing is around if you have lots of genes mapping different pathways how do you find the minimum set of pathways that explains that and this path has been around for quite a few years and it's actually one of those tools that still has to be used so it's quite nice for doing that okay any questions before I go here no? okay so Humon 2 actually didn't get published yet and I've been holding out I used to teach Humon 1 and I was waiting for the paper to come out it still hasn't come out but I read the paper and asked for it so I get a chance to read it and now people are starting to use it and publish their data sets ahead of time so I thought I'll go over Humon 2 it was also nice that they actually just lent me their slides which saved me a ton of work so Eric and I made these slides and so I'm just walking through just a couple of them from another presentation okay so the whole idea with Humon 2 is that you have basically your reads here and they're dividing their reads up into four sort of examples here so species 1 and species 2 which they know about they have ambiguous ones and things that they don't really know what to call it and then they have completely novel sequences that they don't have any information about so the first step in their pipeline is to actually use Metaflan to basically find out what the taxonomic profile is and so they use a subset of these reads mapping their specific genes to come up with this taxonomic profile after that what they do is once they know that in this case they have a blue guy and a yellow guy sort of yellow that they'll make a pan genome out of the tax that are represented in that and so they'll make a custom database that then they search the rest of their genes against so now they're searching all the reads in that season they're basically doing recruitment to these genomes and then if there's still reads left over which are these these ambiguous ones and also still the completely novel ones their last step is then to take those reads and then do this translated search like a BLAST X or translated search against UNIRF 9 so it's this large comprehensive database I just talked about so there's sort of two major advantages the first is that they make a link between taxa and functions which is quite nice in the output and the other big advantage is that it's faster because they screen a lot of reads to this smaller reference database that they basically make on its live their only downside right now is that basically their novel sequences are still left over they don't do anything with so these are sequences that they don't know the taxonomic profile for and they're sequences that didn't hit this large comprehensive database they say that these would be perfect for maybe doing some sort of assembly or some other approach afterwards so you do some de novo sequence and clustering and then try to do something else but they don't they don't touch those in this framework does that make sense okay so the output sort of looks like this so the idea is that you get back some sort of labels so by default you get back UNIRF 9 clusters you get a gene name for that and then you get this total gene abundance which is this normalised count based on reads per kilobase and then take them to some of the other factors below that they collapse also how this is so you can actually get stratified and unstratified so unstratified basically means you would just get this first line of information if you want the stratified you get also this information where they know particular species how those counts are contributing to that function and then they also know the unclassified proportion of those within this so these all basically sum up to this 600 number that makes sense they also map these UNIRFs to metasync and so then you get pathway abundances basically over coverage so coverage just means you get a 1 or a 0 if it's covered by that microbiome I don't usually find that relevant to those but then you also get this abundance for this thing and again you get this either unstratified version or you can ask for the stratified version where you get the breakdown of that pathway to fit in that sample one of the nice things from the paper and sort of I'm finished in following a little bit and you can get from this data and you can also get this from high press which we'll talk a little bit is this idea of functional diversity or they call it Contributional Diversity in the paper I don't really like the names maybe we'll come up with something different but the idea is you can then look at particular functional groups and you can ask is there per sample functional diversity lower high and then also is there much diversity for that functional group between samples at least what these axes are indicating so this is an example of low sample diversity for this function because it's dominated mostly by one taxa and it's also fairly consistent across the samples that you're looking at this is an example of high between subject diversity like this where it's dominated mostly by the single species that single species changes a lot you can also get this idea of high between diversity but still dominated mostly by sorry this is fairly consistent across the samples but it's dominated by a lot of different species and then lastly where you have just a lot of changes both within the sample diversity and across samples as well so I think it's really nice because this starts to move us into I think the direction we've been wanting to go for a while in the field where we're starting to look at not just what functional changes are but what taxa are contributing to this functional changes and I think Humanity gives us one idea of this you can get this data from Pypros I'd like to see other tools do this as well so we can start to look at how do we handle this data better and how does that link to biology so if you're really in the pipeline this is the pipeline of the data again I don't think I need to walk through it but the idea is you're inputting mode of paper time you're inputting quality control and you're creating some of the meta-flans into raw reads fast queue files of reads that have been filtered and quality checks you're doing tax on the pro-flying with meta-flans too and then using this pan genome database which they call chocoflan and now they're getting basically reads and map to those organisms that gives them that stratified information any of the reads that didn't get map go into this alternate pathway where they're doing this diamond search against Uniref and then they basically combine that back together and then look at how those genes map to pathways within meta-psych and you get three major outputs from that so you get gene family abundance pathway abundance and pathway coverage so their default is for Uniref 98 to meta-psych but the nice thing is you can also map to things like keg still I think cog as well so it's sort of a it gives you a segue into multiple things alright any questions about the algorithm can you have some option to say is that function related to that kind of sample group or can you associate that function with let's say disease or antibiotic use or whatever or just descriptive so this is just multi-focus on providing that annotation and then you would usually use those tables afterwards to do some sort of statistical test to ask what you're getting at is then do you do a simple sort of t-test across all your protein families or do you do a NOVA or do you do some other statistic to figure out what functions are I don't know if we need something new for this whole stratified data thing but we'll have to think about that more in the future so usually what happens is you just treat each functional group as separate things that you're testing and then obviously you can counter multiple test correction and then how to test for so the functions could actually now be the same between disease and healthy different taxonomic makeup of how that how that's made up I don't know how to test that yet in the back first so there's so they're assuming sorry yeah that's a good question they're assuming you did that beforehand so you would have to pre-screen your reads not just for quality but also for contamination before sorry that's a good point and in the tutorial you'll actually do that so usually you'll use bow tie if they were mappers and then if you're doing human genome obviously it's down on human genome and there's one that's combined with biac as well which is often used in sequencing but obviously if you're doing mouse or paddle or whatever you want you've got to have it in that way first you have a quick question yeah it's just a competing capacity for that to lead to run computational power it's actually fairly good so that's one of the things I talk quite a bit so do you actually do you have an idea from how I took this the self-sampled agent no that I did the process was pretty fast that was only 32 samples 24s 6 hours it's in the hours range right so you could do you could do one sample with a regular amount and probably an hour or two of things so scale it up for how many samples you have but this is just on a regular sort of computer it depends because I tried running it on 15 gigabays data sets and 48 hours was not enough on the setup I used on our supercomputer here but I need to try it on another that has more cores for nodes but it depends on the size because it's linear on the size it's linear on the size but never reads in your internet yeah yeah so it was the first part of the question again though would it be restricted to those pan genomes yes yes so you're limited by the what you know it's the same limitations that's what it uses so your stratification of the function is limited to what METAFLAM knows about which is limited based on the genomes you have in databases but huma2 will then put other reads into this function into this unclassified category so you still get the functions from the things you don't know what they are does that make sense okay my last slide is not actually just taking it out but sometimes I have a little small thing so just a small thing obviously I think John's going to talk about metatranscriptomics but just the idea that we're measuring here like DNA so we're measuring the presence of a gene in a sample we're not measuring the transcripts we're not measuring proteins or metabolites and so my always caution is just to be a little careful around your language when you're talking about things things because we're sort of assuming a lot of times that the increase of a particular taxa with a gene is then actively doing that right so if you will talk about say a butyrae kinase and then that leads to or maybe short chain fatty acids are increasing but we don't know for sure so it's just language around that yeah quick question in the back hi I'm related to that I'm listening to a couple of colleagues we never really found the death versus life problem right and he said apparently there is arguments available that you know depending on how the gene is sort of adjusted how it looks you can't make a conclusion which of the butyrae are actually alive and which were just literally kind of dead and you do have a comment about that is that already implemented because it's important there's a dormant microbiome there's a dormant bacteria there's only anything you want to contribute yeah so obviously the easiest way to get around that is to do metatranscriptomics because then you already know the genomics what they did is basically you have a sequence of the five prime and the three prime and then you make the that's how you do it I don't know yet yeah there was a nice paper I can't think of the name but where they did phocetometry on cells and so they sorted based on cell characteristics to get dead or alive and then looked at the taxonomic profiles within those and obviously showed there was actually differences right so it would affect it in some sense so that's probably the third problem I have was that the third problem? yeah it was the first problem yeah it was really nice that was a really clear example but just straight from DNA I think it's I haven't seen anything really conclusive it's pretty tough it dies it dies at cells it dies at links in the cells in our lives we pull up the time out so we can see oh is this the way you cross like the DNA within the cells? yeah would that really solve it though? would that really solve it though? would that really solve it though? maybe we can do it so the next thing that I would compromise is IREP which is this idea you can measure replication rates within genomes so you take many genomes and then you look at the the coverage of that genome and then you can get an idea of if it's replicating another line so it doesn't really tell you if it's alive or dead but it tells you that it's actively replicating so that's one other tool yep I'm all done by the way so as we pull up how long does the bacteria that remains in the gut so the question so the question was the previous question on distinguishing dead and moderate bacteria so the bacteria is dead and the nitric acid is in the gut environment how much does it exist how long does it exist then? for quite a while so I think that's the problem especially with stool samples a lot of what we're measuring is DNA from cells that are dead you can make the argument that at least it's maybe representative of what's happening upstream where you don't really care what's happening before it's passed anyway but that's a whole other question so I don't have a satisfying answer besides I don't know if anyone knows percentages but so from the term of paper I think those from stool samples I guess they use slow cytometry and then use markers on the cells to figure out there's two or three different ways to measure basically so you have probability from the way back so I can't remember the percentage from last week if I can look it up and post it somewhere okay while I'm around you can ask more questions I'll let Frederick talk about his stuff so I'll be talking about assembly-based metagenomics so first I'll go through a bit of the details of interpretation that I think that we can have about metagenomics it's like looking into the library of someone so this is one of my bookshelves in my basement and if you want to know what's in my bookshelf then you can do cultures you take one book and you read it so you isolate the bacteria and then either you measure how it grows in different media or you sequence it and get the vaginal so you know lots of things about this single bacteria this single book in this bookshelf the second thing would be like just to scan all those barcodes so then you would know the titles and others of each of these books so this is basically doing the 16S the taxonomy of the 16S so you find the names of the bacteria and their lineage that are found in the library and you can also get some metadata this is a science books this is an oral book this is self-help this is business this is science fiction so this is the same as using pie crust on 16S to know a bit of the function based on which taxa are in there so you know a book by Stephen King what you may guess it is an oral book or something that may be an oral book and then there's the third part the concrete metagenome where you take all the books you rip them off by pages or by lines or by little strands and then you try to make sense of that so you pick strands and strands of paper and then you say this comes from a king book because in my reference database this sentence I know is different to a king book and so on so you're able to get information on the taxonomy and on the contents so you can look in language analysis they take what this there's a shantik sikata I don't know what's it in English our words that are within the same frame of mind so if we talk about stool about pieces about poop about there are many synonyms that we can use for that so you can get more information about the books by looking at this kind of information that we get and if someone is parsimonious you can take all these little strands of paper and like do a jigsaw puzzle and that's what the assembly pairs do but before well so basically what we do is that we take the vector of population we extract the DNA and then we sequence everything that's basically what we do in jacob base lab we try to measure everything or at least the most that we can do and why do we want to assemble but it's because the length of the sequence that we will be analyzing will determine what information we will be able to get so if we have short reads like very short reads we will only be able to map SNPs according to a reference at 100 to maybe 500 nucleotides we will have short functional signatures and maybe we will be able to assign specific origin like what a human does or a metaflan does at 100 and at 1000 nucleotides we can almost have a gene so we are probably able to start to look at really complete genes and then you go further so if you go up to the tens of thousands then you get longer operands you get several genes that are side by side so you will be able to compare the symphony of the different assemblies you will be doing to compare two reference genomes and if you go further and further then you will be able to look at pathogenesis, the island, mobile elements and even whole chromosome organizations so the longer you get the more information you have but it probably depends on what you want to do and what are the questions you are asking your data and for these longer sequences sometimes we are able to get them through assembly but sometimes when you need new technologies like bio sequencing and other new sequencers that sequence very very long fragments so one important thing when you consider assembly-based metagenomics and it is true for any genomic sequencing that you will be doing the sequencing coverage is one of the key factors basically you can calculate the depth by taking the number of nucleotide sequence and you divide it by the expected length of the genome that you have or the metagenome estimating the size of the metagenome well I wouldn't go into that but still it's an interesting metric and we can use it in metagenome I'll talk with about it a bit later so the optimal coverage of sequencing depends really on the application and on the diversity of the population you are looking at if you want to be able to get the genome of a bacteria that's 0.1% of your population then you will need to sequence a lot to have enough free to be able to sequence this because if you need 100 more reads to sequence something that's 0.1% of the population then something that is at 10% of this population so there may be other approaches that you can use so the quality of your assembly will be depends on the coverage you have of the complexity of the sequence what do I need the complexity of the sequence microbes at different levels of repetition in their genomes so if a repetition is a bit longer for example it has a repeated gene or a repeated mobile element that's longer than your reads then you'll have a hard time for example in this business you have lots of small context or people working on parasite got some colleague here flashmania flashmania is when trying to assemble because you got so many repeated sequences that even if you sequence very very deep your genome will be broken in hundreds, two thousand of context so depending on the different bacteria some are easier to assemble and some are harder than others so if you see this one is pretty easy because there are not many dots here so not many repeated sequences and you get pretty long context so 20 context but then when you go to the others there are several repeated sequences so you get more broken down context and if to that you have a whole bacterial population that may share some parts of genomes then it may break your assembly even more and the other thing is the diversity of your population depending on what is contained in your example you know that my fingers, microbiomes not the same as my stool microbiome they don't have the same thing and if you see here this one right here is the stool so the estimate that you need to have about maybe 5 to 10 gigabytes of sequence well I would say a bit more to get good coverage but yeah you need in this range between 10 to 20 gigabytes to have something interesting to analyze when doing the assembly but here you got tropical forest soil and then you get need hundreds of gigabytes of sequence so this is these are huge microbiome one of the big difference is that in the gut microbiome you get some bacteria that are pretty abundant and then it stops to drop and you have all those more ones but in the soil there is a lot of low abundance so if you want to assemble things you need to have enough copies of the bacteria in the sample in the gut there are a lot of facts that you will be able to assemble because the distribution makes it so that there are several that you have at 1 to 10 percent of the population while for example in the soil you have so much that are under this one person, 0.1 percent so you won't be really able to assemble them so the diversity of the sample you are looking at is very critical in designing your experiment for sequencing to assemble so here is a little some slide at some work we did some validation on how much sequence do we need to get a good single to get good results so this one is texanolycop profiling using remeta so it's basically the same it's not exactly the same meta but it's the same kind of outcome that we get with meta plan so as you see here in each of these there's a little bump in the beginning and so before 8 million reads because the first line each little line that you don't see because it's too small and you're too far away so each of these little lines is 8 million reads so once you get to 10 million reads or 16 then you're about flat and you get the right one before that you get a bit more error and your quantification is a bit skewed to the ones that are more abundant this is true for remeta I don't know for the others then there's this so the difference is is that important? that's a good point probably not but you know if you're here if you sequence 500 1000 for example reads or using remeta maybe remeta's not a good one to do your analysis maybe you'd rather go with another one maybe more sequence or you can trust those that aren't really abundant but those that are less abundant then you'll have lots of possibilities so we did the same well it's the same data basically and we compared it to the size of the assembly so as we increase the number of reads here in different samples we see that even at 120 and even 140 million reads we're still not on the flat side of the assembly so we haven't assembled everything yet and that's normal because there are a lot of texas that are under 0.1% that are very low abundance in the gut and that we don't really see because these are stool samples that we see very low abundance in the stool samples so as we go deeper we get even more stuff but still, assembly is around 200 million necliotides give us pretty much lots of stuff to work with so we don't see everything but we have lots of material to work on but we'll be mostly working on the high abundance so now I'll be talking I don't need to elaborate because I didn't put the slide there but a thing we use to go deeper to get the sequence of the low abundance bacteria is that we use broad culture but a very broad culture in liquid and then we put stools and then we sequence the culture and then it allows us to get sequence from lots of bacteria that are not abundant enough for example in the microbiomes usually we're not able to sequence E. coli because it's too low abundance but if we do just standard culture and maximum enrichment product then we see a lot of the bacteria so that's a trick that you can use if you're not interested in the quantity but in the genes so we choose an assembler so there are two big families of assemblers there's the classical one the one that does the jigsaw puzzle so it's like overlap the reads and then it builds a sequence with that people have not been working much on that recently although people are doing this method and the next one which is based on camers so we've been talking a bit about camers this morning and we'll talk a bit more about that as a tool of course grows on so basically you're breaking the sequence of a genome into words of a certain length so yeah the genome here the smallest genome in the world so if we break it in camers of length 5 then we get this collection of camers and this can be used to build a graph so if two camers are seen side by side and they share their middle sequences so then they'll put this little edge between them and we'll be able to build a graph based on the succession of the camers so this is the method that many assemblers currently use some of them are people are currently working on mixing the camer with the more overall alignment methods of the overlap layout consensus and there are several assemblers that are available two of the many very popular ones currently is MetaSpace which has been published very recently and then MigaIt which is pretty much like in the literature in our case we use Remetta Remetta was created in the lab by one of my former colleagues each of these there are some very bad assemblers there are a few that are pretty good so those that are here are kind of pretty good but they all have their advantages and inconveniences for example for us we have a hard time using MigaIt and MetaSpace because they require lots of memory and they can run in parallel on supercomputers but they are not distributed so they are not able to share the memory between the computer nodes so if we have lots of sequences and our supercomputer is smaller with smaller memory then these won't work with big metagenomes as we like to use so this is why we are still working with this is one of the reasons because there are others that we are still working with Remetta is because this one is parallel but in addition it is distributed so it allows to share the memory between the different computer nodes which allows us to use big metagenomes we just scale we just add cores if the metagenome gets big and there are several others this paper I recommend that you check it it was published in 2001 recently it is a very good review of the recent metagenome assemblers and this is a picture from this paper so it is basically the same kind of analysis I showed previously so this is the sequencing and this is for the different assemblers the recovery of the genome so the first time of the genome that you get and for some of the assemblers we get very good coverage of the genomes but here for example the zero line is Remetta so for a rate you need to have a good coverage to be able to assemble that is one of the drawbacks of the software you need to be pretty deep on the genome at least maybe between 3rd, 30th and 40th to be able to get a good assembly of it but the other assemblers don't necessarily have this problem so each of them has their advantages and their drawbacks and you may want to compare different assemblers when you do your analysis depending on the type of dataset the available computer infrastructure so it's pretty important so here I've got my dinosaurs figure so if you see there are some of these plots that look like dinosaurs so here on the x-axis you've got the reads at our sequence multiplied by the proportion of the bacterium each of these little putative dinosaurs are a context that potentially originate by a specific taxa so if we zoom in here we've got a cherishia palae actuality, security and and here we have the quantity of sequencing that we think is probably associated to this bacteria and then our assembly size so we see that at some point as we increase the quantity of sequence we're able to get the full genome but before that it can be a little bit so I'll talk a bit about trimetta because it's a good software and that's what we'll be talking about in the practical class a bit later it's a good software to really get the feel of the different approaches that can be used to characterize your context so basically what it does is it takes the reads, do a big airy ball here the room graph and then it does the assembly but it also colors the assembly based on the reference that had based on genome and taxonomy so with that we're able to profile the proportion of bacteria that are found in the sample and also to color the assembly so to tell which context comes from which bacteria so it makes the binning pretty easy if you have a context that looks a bit like what you have in reference database then you're able to infer from which bacteria it comes from one thing that we must take into account though it's not copies of genomes we're looking at we're looking at proportions of cameras so it's a proportion of sequence that's associated to bacteria that we see here not the number of genome of bacteria that we have in a sample some bacteria have genomes of two millions, other have four millions so it's important to take that into account in your interpretation and here we can give it a taxonomy and reference genomes we can also give it any functional collection that you want you can give it sequences of phages antibiotic resistant genes any function that you want you can count anything in your airy there are softwares to evaluate metagenomes there's the meta-quest that's pretty cool to compare different metagenomes to give you the difference between the number of countings and the size of the genome the number of mismatches when comparing to reference genomes this is a pretty useful software when you're comparing and testing the different assemblers for your projects and now that we have very cool metagenomes to work with that we can do so I included four in this five of these in this presentation so first we can just annotate the assemblies so basically we want to find the genes so in the lab we use protocols that are others that are available by different labs then you can use the genes that you found to annotate them so either by aligning them to a reference database like Uniprot or you can find domains using PFAN or in some of our projects where we look at the effect of antibiotics on the gut microbiome then we align them to a reference database of resistant genes so with that we're able to ask questions and if you're using a very specific database it's more manageable because if you compare to Uniprot and you've got a hundred million of nucleotides assembled for metagenome it can take some time and then you can compare the genes that you found to members of a different cell members of this gene family for example you find the Betalactamase gene then you can compare it to the other Betalactamase genes that are available do classical phylogeny and you can look at the distribution of genes between microbiomes so that's one of the ways that we use to do that genes predicted by similarity so basically we take all the genes, we label them for example and then we run a CDA with a certain threshold of percent of antibiotics and then we represent the distribution in genomes or in metagenomes for example we see here that this batch of genes is presenting almost all the samples in our study while there's a group here that is almost absent from this group here it's absent here but it's present in all the others so I must confide in you that this is not a metagenome this is a clostridium difficile that is it but here you've got the nap one that is very common and a big problem in hospital and it is very convergent between the different hospitals so this is a very it's like an epidemic that we had a few years ago and these are all the others that we find but you can do that on microbiomes too and you can go even further in that kind of approaches so here I have these little pink bacteria and the question I want to ask is are these bacteria representative how to be compared to the all the others pink bacteria in the world so what I have here is in a shear shelf ally so we found the resistant genes so here you see there are several microbiomes for which we add enough stuff to assemble E. coli and that was culture and here we have others so that's probably microbiomes so we see that there are several resistant genes that are present in all the microbiomes and then I can run the same analysis on all the shear shelf ally in NCBI so three to four thousand genomes here so here there are something like four thousand genomes of E. coli and you see that there are several genes that match those in the bottom in front of my microbiome that are shared by every single strains and these genes are in fact in the core genome of E. coli so this approach allows us to say okay this resistant genes is probably not linked to clinical problems it is normal to see it in E. coli and then there are the others like here that is found only in a few samples so it is either associated to a strain of E. coli or to mobile elements and you add here the NC gene which is a gene that is shared by all E. coli and there are different subtypes and you see that if you have I just want to touch the screen you may have this in this allele or the other so people are either one or the other and the same in the stream so this is a marker of the clade probably of E. coli so as we talk previously we can do genome binning so the goal is yeah do you get when you do your assembly for example you call it very good example do you get many strains of E. coli or just one strain to get for example here there are those that are darker it means that you have more than one copy of the gene so there is probably two strains of E. coli in the macrovial so yet it is truly possible that you have that but you could not in a few let's say that this E. coli strain they have all the pathogenic genes so you cannot really really appetite on E. coli or something well you could look for the pathogenesis genes and say they have them in the same E. coli you can have one E. coli with just one vernet gene and it is not considered pathogenic but then you can look at the coverage of the context and if you have E. coli context that are like at 10x and those that are in 100x then you can infer that some genes are from the one at 10x and the others are from the one that is 100x by looking at the coverage if you add more information we are able to triangulate some information so do you have another question? yeah this is not the entire gene but simply the genes the resistant genes from this yeah this was for the resistant genes this was a kind of proof of concept because imaging all the genes sometimes can make a pretty big table ok so now we will be talking a bit about pinning it is a bit related to your question about E. coli so basically what you want to do is that in your microbiome sample you have different bacteria and then you do the extraction you are sequencing and then you get your samples or mixed up all your little strands of paper from your books so you do your pinning so you want to say ok these sequences come from this genome these sequences come from these genomes and there are several ways to do that so the first one is the reference you know in taxonomy by doing an alignment to a reference database or doing blast diamond or whatever what we like to do is color them using camera so I have already explained that and all the results I have shown previously on the on the pinning and on the explanation of the different coverages of the bacteria was done using this method so as long as you have genomes in your reference that are close enough to what you are sequencing then you will be able to pin them that's for sure that you may have some parts of genomes that you won't be able to pin because either you don't have it here or it's too it has too much sharing between different bacteria of different taxa so this make it a bit more true another method is differential abundance so you could let's say ok well on the x-axis I will take gc content from my context and on the y-axis I will take coverage or you could what the initial paper on that nature biotech in 2013 did is that we used two types of sequencing that would skew the coverage of the different genomes and that allowed us to plot them and to separate the context between different between different position in this plot so they were able to separate them finally there is one that can be used in the lab it's binning using make pairs or long reads so if you're able to have longer reads to stitch up your things together like in this very nice figure then you may be able to more easily separate the different genomes between your sample and then there's the re-binning and re-assembly so using different algorithms you're able to bin your reads and then you take the reads from bin one and then you assembly based on this bin I'd rather use the other methods but there are pretty interesting days on using any of these methods so we talked a bit about that earlier in the question so this is to estimate the replication rate of bacteria a method about that was suggested I think was from Iran in the NAP in Israel that suggested the methods first and then the IRA method was suggested afterwards so basically the bacteria replicate from the original replication and then in theory if the bacteria is going faster then you'll have more coverage depth here than here so if you calculate the differential then you're able to estimate the rate of growth of the bacteria so you get a curve that looks like that so it's going fast then your slope will be higher and it's going slow then it will be about the same for so as Jacques said we haven't implemented that but in theory it looks good in practice that's another question then another thing that we can do is we can try to reconstruct metabolic pathways and this is it looks a bit like what Morgan explained but this is something that is in very big development currently so that people be able to do flux balance analysis or really systems biology with differential equations and stuff that are way over my head but this is discussed more and more in the different meetings so to try to understand what's happening in the microbiome by plotting and studying what's happening within its metabolic pathways and so here is a website, it's a Canadian initiative as you can see here there are several, Genome Canada Compute Canada so it's a Canadian initiative they reconstructed the metabolic pathways found in 50 or something metagenomes from the human microbiome project this is pretty interesting and I was made aware of that this week so I haven't studied this kind of approach that much but there is a representation of all the pathways that they have in their models so it's a pretty cool thing and you can zoom in to get all these different pathways and it may be an approach that has lots of promise for the future and then you can go a bit further, this is the work where they try to find a microbial metabolic influence network it means that for example here you've got well another expert and all that, so all this metabolism thing but they link the different bacteria from the microbiome and from the network with the different metabolites that you get so this is a framework that you could use afterwards to integrate metabolomics and metagenomes so you may look at the reference here you've got the and go read the papers, it's a pretty interesting paper and this kind of approach are getting more and more studied and they're getting more mature people have been working on that for a few years but now we're almost ready for primetime and finally the last point I want to touch is the secondary metabolism so this was a big deal a few years ago where they found new antibiotics but in the vaginal microbiome so these are approaches where you try to find clusters of genes, operons that look like operons that synthesize natural molecules like antibiotics or other things so basically well and here it says that in different places of the body there are a lot of different biosynthetic gene clusters that allow to synthesize different molecules so this is the rectosilium so an antibiotic that is discovered using this method and this lab and others are starting to publish more and more different antibiotics that come from biosynthetic gene clusters so basically it looks like a cluster of genes that have different functions and then with the sequence of the genes chemists are able to infer the synthesis of molecules and here as you can see they are they come in every shape and size and they have different distribution between the strains and different bacteria and different environment so this is a software from a Canadian lab I think they're from McMaster so it's called prison and what they do is that they annotate the different genes with known domains and genes and with some machine learning they are able to infer which molecules will be produced by these genes so this is kind of a way to turn over discover based on references but including some machine learning in the process to discover new molecules or to infer which molecules will be produced by a bacteria that you have in your population so this is the different aspects I wanted to touch during this course but one thing that you must not forget is that one E. coli is not like the other one decided and one bacteria is not like the other they're all a bit different they may have a slightly different gene set so as you do your analysis you must take into account that there is a diversity two people can have the same taxonomical profile at the family level but when you look at the genes it can be totally different so it is important to keep that in mind in your future analysis so with your coloured assemblies do you I guess someone could agree a lot and it could be LGT or some could be sort of misassemblies I guess you don't quite a bother which is what you just described for now yeah well we get as we get more data we can get more tolerant of risk that's a bit out of I see it I have a lot of data a lot of sequences and if I do one percent error but all the information I get from the 99% may be worthwhile for the one percent error but we haven't characterised the error based on that but Ray is pretty conservative on his assembly so that's one of the advantages it doesn't make that much errors compared to other assemblies like space for example so we tolerate the errors and we try to be aware of them when we work but as we get big numbers then the errors are part of the numbers top number how many corporate leads or leads per kilo before you decide that species is present yeah well if you assemble context and you're able to get the 4 millions of equal then you can expect that this is present so but if you're I have only a few context that are from the species then you can't really say that it is present and sometimes you can have problems for example we had mice shotgun microbiomes and in the profiling we did not include the mouse and in the mouse genomes there are some value sequences that have cameras that look a lot like and I don't remember which bacteria so when we did the analysis without the reference of the mice then we had some text that appeared that seemed to be very very important but they were not so we rerun the thing using the mice and then we got proper analysis so that's one of the things that it allows you to give it the background and it includes it in the analysis