 This morning, we talked all about manageromic taxonomic annotation. Yesterday, you talked all about taxonomy. And finally, you're going to not talk about taxonomy. So learning objectives is going to discuss what just broadly we mean between functional composition versus taxonomic composition. I'm going to try to give you a really overview high level of the fact that there's different functional databases out there and what might take is on some of them. We're going to talk about the pros and cons of maybe assembly and gene calling when we're doing functional annotation. I'm going to walk you through how HUMON actually works. And then you'll use HUMON and STAP in the tutorial this afternoon to functionally annotate the same samples that you taxonomically annotated this morning. OK, so functional composition. What are they doing? So obviously, this morning, we talked about taxonomics. So you get the list of the microbes and who is there, whereas functional composition really answers what are they doing. So manageromics provides the opportunity to catalog the entire set of genes from entire community. So you have to remember that these reads come from all the microbes. And some of these reads will actually contain parts or entire genes, but usually just parts of genes. And the idea is try to annotate those fragments with some sort of functional annotation label. So what do we mean by function? I think this kind of came up maybe yesterday or this morning. It's all blending together. So by function, we could be just talking very general about very loose terms like photosynthesis or nitrogen metabolism or glycolysis. And that's really obviously function at a very high level. Or we could be talking about really specific groups of orthologs. Everybody remember what an orthologue is? But you got a definition for me? Come on. Shout it out. Same function, different organism? Yep, pretty much. Anybody else? That's good. Yeah, the phylogenetic in me, sort of says. So usually they'd have to be phylogenetically related. You could have convergent evolution that would lead to the same function in different organisms, but that wouldn't be typically an orthologue. But yes. And then technically as well, orthologs don't necessarily always have to have the same function. But anyway, groups of genes across families. And so this could be, say, something like NIF-H, which is involved in nitrogen fixation. It could be, say, an EC number is a different type of classification, which is from the enzyme classification system. So 1.1.1 is alcohol dehydrogenase. Or it could be a different database like the K-ortholog database, which is K00929, which is butyrate kinase, which is involved in the production of butyrate. So often when we're talking about function, we could be specifically talking about a particular gene or a particular enzyme, or we could be talking about something more general like a pathway or a module. And so we're going to kind of relate these things together. So there's lots of various functional databases out there. I'm just going to run through a few of them. So the COG stands for clusters of orthologous groups. It's been a classification that's been around for a long, long, long time. The actual database hasn't really been updated since 2003, which is like, I don't even know. Could you even sequence back then? I mean, I was joking. That's a long time. But you still see the annotations. So COG annotations are the ones with those letters that usually have R, T, S. I don't know what they are anymore. But kind of a breakdown into about 12 to 15 different groups. They're still actually used quite a bit. So people annotate things and annotate them with COG categories. The seed is another classification system. So the seed is actually used by systems like RAST and MG RAST. So MG RAST is this metagenomic server where you can upload your data. And then RAST is like a genome annotation system for microbial genomes. And so those systems uses the seed system so it's their type of functional annotation. PFAM is more focused on protein domains, although they do have whole genes as well. And they usually use HMMs to describe the PFAMs. Eggnog is another one. It's more comprehensive. It has about 190,000 different gene families. Uniref is becoming, I think, more and more popular more recently. So Uniref is a prediction, sort of an automated clustering where they keep updating these clusters. And their cluster, kind of like O2Us, are at different percentages, right? So Uniref 100 means that everything within that cluster is 100% identical. Uniref 90 would be 90% or less. Uniref 50 would be 50% or less. So if you go to Uniref 50, there's less families because they're grouped into bigger groups. And then Uniref 90 would have more gene families but with tighter groups. So it's kind of nice because they're constantly updating it with all the new genomes that come out. And so I'm not going to talk about it much more today, but there are systems that are starting to use that in more detail in the future. The keg is very popular still. Each entry is really well annotated. So if you've ever read papers, you'll probably often see these nice metabolism charts. And they link these keg orthologs, which I'll talk about in a second, into higher things like modules or pathways. Full access actually requires a fee now. They went private, I don't know, three or four years ago probably now, if I had to guess. So people are still just kind of using the last snapshot of the database since then. But you can still go to the database and get annotations from the system. That's all open. But sort of the back files that includes all the sequences with those new keg orthologs isn't updated unless you have access to the private site. So people basically use the last snapshot and continue developing tools for it. And slowly I think we're starting to migrate away a little bit away from the keg. And more people are starting to use things like, say, MetaCyc, which is the next one. It's more microbe focused than the keg, which is kind of nice, whereas the keg has a lot of human functions. So MetaCyc is an annotation system and classification system that's more focused on microbes. Okay, is there any favorite functional database that I missed in this one? What's that? Gene Ontology, of course. Yeah, Gene Ontology. Yeah, absolutely. So Gene Ontology is another really common system. And the nice thing about Gene Ontology is that it's an ontology, so you have this hierarchy structure where genes get mapped into more general and general categories. Yeah, absolutely, I need to update this with Gene Ontology. Gene Ontology's becoming maybe a little bit more used with microbes, but for a while it was a bit, it was very useful for human and other larger eukaryotes. But for microbes, again, some of the things, kind of like the keg, some of the genes aren't really well represented within microbes, but still, it's still pretty widely used. Yeah, that's a good one. Okay, so... Which ones would you recommend? Which ones I recommend? Yeah. Yeah. So I think the, so the last three are probably the most, okay, so I would recommend, so I'm gonna talk about keg today. So keg's great because whatever you find is really well-described. So if you find something significantly different, it's well-described, you can map it to modules or pathways, and you can make these nice, pretty pictures, it's awesome. But the problem with the keg, and I don't know if I discuss this later, but whatever I was discussing now, is it's not as comprehensive, right? So the keg database has 10,000-ish keg orthologs, you know, gene families, whereas eggnog has 190,000 different families, and like UNRF90, I think has like 250,000 or more gene families. So what that means is that if you take your genes and you blast them or a similarity search against the keg orthologs, you're only gonna annotate a certain percentage of those reads. Now, that happens for various reasons. One, maybe because there's no gene there, or two, because there's just not a significant finding, but it could just be because the keg ortholog database is not very comprehensive. So there's lots of genes out there we talked about this morning that we don't know their function, right? We don't know anything about them. So there's sort of left out in the dust of the keg database. So I think it makes more sense to move in a direction where at least we try to characterize gene families as most comprehensive as possible, and if they're unknown, well, at least we identify them, and then later on we can try to worry about annotating them. So personally, I think UNRF is a really good solution for this because they're automatically updated. It's not someone's database that they made once and it's gone obsolete. It seems to be has staying power. And I'm starting to see new tools. I'm gonna talk about humon one today, but I know humon two is released, but it's not published. So I decided not to describe the algorithm, whereas we're gonna dig into humon. But I know with humon two, they're using the UNRF 90 as their database, which is much more comprehensive. And I think that's much more satisfying, especially if you're working in an environmental microbiome where you might have a lot of unknowns. And then MetaSyke seems really good too. I'm just not as familiar with the system. So MetaSyke has nicer, in more depth annotations for microbes. So the direction I like to see is UNRF and MetaSyke. But for now, I think keg's still a pretty good popular tool. So it's satisfied. So can you plan to adapt? Oh, yeah. So PyCrust basically uses, it doesn't really matter that what we do is we take all the genomes that have been sequenced. I'm gonna use the annotations for those genomes. So if we can get this massive table of genomes with different types of functional categories, and it's reliable, then PyCrust works with it. So we focused on keg just for our validation, but we do have cog predictions. We do have RFAM predictions. And we started to play with MetaSyke, and we actually made it possible, but there was no validation. So we didn't really say that, it's one thing to make a prediction, but when you have nothing to validate it, it's like I don't really wanna let it out into the world. So it's definitely possible. And we can actually make the predictions. It's just we have to do some validation to make sure that it's accurate, like it was accurate for the keg annotations. Yeah. Okay. So I'm gonna describe the keg, but I think this applies, a lot of this stuff applies to other databases. So it's just good to have sort of an idea of how these things are assembled. So with the keg database, the most specific unit is these keg orthologs, then the most specific. And they're thought to be a particular gene that's doing some exact function. So now I've quoted like three different numbers of how many KOs are in the database. I should just look it up. So now there's about 12,000 KOs in the database. And then these are linked into keg modules and keg pathways. They have identifiers like K with five numbers after it. And it's slightly confusing because people refer to them as keg orthologs or KOs. But then the identifier for keg pathways is actually KO, which is super confusing. But keg orthologs or KOs are actually just a big K. And then pathways are actually these smaller prefects. But so if you go to keg database, if you get like a prediction from pie crust or you get an annotation, it often has an ID. And you can just go to keg database and it'll tell you that's the entry. There's often a common name for the gene there listed. There's a longer definition of the protein products. And then it lists different pathways as associated with. So remember I told you that keg orthologs are often grouped into multiple pathways, right? So this shows that for this particular KO, it's involved in all these different pathways. It's also involved in these three different modules. And then it's also linked to this disease, right? So the keg database tries to organize stuff into these as well. And it also has these bright hierarchies which just basically breaks some of these pathways into more general categories as well. So KO is the bigger top hierarchy. And then you have these modules. No, the opposite. So keg orthologs are actually the most specific, the smallest thing at the bottom. The modules are bigger. The modules are bigger. Modules subset keg orthologs, and then pathways are even bigger than that. Yeah. So keg ortholog is the same as sort of like a gene name or something if you want to associate that in your head. So it's a specific gene. OK, so then you can group these keg orthologs into keg modules. These are sort of mainly defined functional units. It's thought that there are small groups of KOs that function together. There's about 750 keg modules in the database. And they're labeled with this big M for module. What's cool about keg modules that's different from keg pathways is that they have this really well robust definition of what makes up a module. So it's not just a bag or a loose clustering of keg orthologs. So in order for a module to be present in a genome, but in a genome, it has to have a certain selection of keg orthologs. So the way that they do this is in a graphical representation as well as just a line you're describing it. So to cover this particular module, which is white boxes, you start at the top and you have to have at least for sure this KO 1803. And then the next step, you can either have K11389 to get through this process, or you have to have one of these plus this one. The next step, you just have to have one of these to get through to the next step, this one for sure, and then one of these two. So that makes sense. And then if you have that and any sort of mixture of those to complete that, then you have that module present. And then so they just represent that with this text for the destination. The nice thing about that is you actually have some structure to it. And these are a bit smaller. Whereas with keg pathways, they're basically just these groupings of genes that are within a pathway. So there's about 233 different pathways. What's nice is that they do have these really nice cool graphical maps. So you can pull them out. And then what's also really cool is you can often highlight your genes that are maybe significantly different across your metagenomes. And then you can over-represent that in a nice graphical figure and relay it to the pathway. So these are usually a little bit more general. And then even these pathways, you can collapse those into even more general terms that aren't really biologically useful, but things just like amino acid metabolism and carbohydrate metabolism. All right, does that make sense? Rockin. OK, so for annotation systems that are out there, basically here's some of the ones I've listed. So for web-based, there's EBI on a genomic server. So this is one of those servers where you just upload it. They do annotation. They use, I think, InterPro and behind-the-scenes, which is quite slow if you try to do it on your own. Or I guess they have these massive servers that just crank out the data. MGRAS, I mentioned it a couple of times already. So MGRAS has been around for a few years. Their server tends to be, I'm really scared now that the audio's gonna be online. I just realized, anyway. But MGRAS tends to be pretty good, but the server's often down recently. So I think it's because a lot of people submit data and it sometimes goes down. And sometimes you don't know how long it's gonna take to get your data back, so it could be a week or two or more. There's also a metagenomic annotation system through IMG slash M. And through that, I believe you have to sort of go through an application. I think it's usually free, but it's through JGI, and I'm not quite sure how to go through that process. But I think it's free as far as I know, and then you can upload it. You have to have a read to keep copy of data. For IMG, yeah? Yeah, join Gino, yeah, yeah. Through the JGI, right? Right, and that's all though. That's not too bad. Sometimes. Well, I'm just saying that's bad for you. Right, yeah. Anyone have experience with those? Thoughts? Do we base, which is sort of more on your system, as Megan, I mentioned this morning has been around for quite a while. Megan basically takes in like a blast output where you've already done the blast offline and then you load it in and then you can relate taxonomic to functions. Clover, I should really just strike this out. It was a virtual machine base that contains sort of an SOP, but it hasn't been upgraded for a while now. That's gonna get scratched soon. And then sort of local based is there's MetaMOS was made where you can sort of customize different things, what database you wanna search with, do you wanna do assembly or not, and lots of different options in there. It's pretty cool. When I was testing it, some of the features are a little buggy and I think they've probably ironed out some of those, but if you're looking for something that's kind of, they've automated a lot of it, but you can maybe change different options. It sounds like a pretty good one. The do-it-yourself option, which is probably still massively done by a lot of people, is you just come up with your own pipeline and that is super popular and I go to posters and people will always have a new MetaGenomics pipeline, it's pretty amazing. And then Heman, I'm gonna talk about later on. Okay, so why the heck is this functional stuff so complicated? Why don't we just blast the NR database with my reads? So there's a few problems with that. One blast is really slow, right? So you have millions of reads, probably, and then you have the NR database, which is millions of genes, and it's gonna be just really too slow. Next thing is that the top hit may not be the actual correct annotation for function, right? So we're thinking about functional annotation here now. So there's always problems with, there's two different reasons why it could be wrong. One, it could be just that the annotation's on the database, or two, it could be that the top hit is 100% identical and then you'll often get these ties between genes and the top one doesn't actually match. The second one, even though they're like 100% identical. So you'll get these sort of bad annotations and it's hard to choose which one is which. There could be a bias from gene lengths or the database. So this has to come down to if you have a long gene. This is kind of like our discussion this morning about the size of the genome. So you can imagine if you have a bigger gene, the chance of hitting that gene with a read is actually more than a gene that's smaller. So again, if you're counting different functions in your table, you wanna down weight those longer genes. So if you had a gene that's four times as long as another gene, then the chance of just hitting that is four times higher and so that's gonna be overrepresented when you're doing your counts. So some people will try to normalize for that gene length. And also there could be bias from the database where the database has been obviously based on whatever's been sequenced in the past. And so if all your top hits are basically the same thing over and over again, you actually don't see that it could be something else further down your list of hits. Another problem is that sequencing depth is usually too shallow to cover all the DNA in the sample. So it's not like genomics where you would sequence a genome to say 30 times coverage and you have reads representing every single nucleotide in that genome and also all the genes. Since we're doing metagenomics and we have sometimes thousands of different O2s, the chance of actually getting a read from every single gene in your metagenome is unlikely unless you're doing a crazy amount of depth or the metagenome is fairly not too diverse. And then the other thing is that if you wanna then collapse these things into say, cake modules or cake pathways, how do you really determine how do you relate that into those things from those individual gene abundances? And a good example is if you have one gene from a whole pathway, does that mean I should count that pathway as one or do I divide by say the number of genes in that pathway or do I wait until I have most of the pathway before I should count it at all? Okay, any questions about that? This is metagenomics after assembly. I'm gonna talk about assembly after. Is that all right? So I was just wondering if not for sure if the last is how do you differentiate several genes in it on all different parts of the quantity as well? Yeah, that'd be a problem. So I think if you do assembly and you have multiple genes from a single thing, people will count each one of those hits. So doing the tools like NG-Rast and whatever, do they take that into account there for a quantity that might be 2000 bases long that might have multiple genes in there? I think so, yes, yeah. And then I think they'll collapse it so that they know that these hits are all to the same area so we don't count those multiple hits. It's the hits to different areas in the gene, yeah. Okay, so this is the sort of humane pipeline and I'm gonna sort of break it down step by step. So the first step is the step that basically all functional annotation pipelines will have is some similarity search to a database. So in this case, they're doing a translated blast against this cake ortholog database. Does everyone know what translated blast is? No. So blast by itself is there's different flavors of blasts so blast N would be a nucleotide versus nucleotide search. Blast P would be a protein query search against a protein database. Blast X, which is a translated blast, is a nucleotide query and then you search it against a protein database and so you search all six frame translations of your nucleotide sequence. Three in forward and then three in the reverse. And then the last one is T blast X which is translated query and the translated database. Does that make sense? So often people are doing functional annotation protein space so they're searching a protein database which if you're interested in non-protein genes then you would not want to do this, right? So it's something to keep in mind if you're interested in RNA genes or other interesting things that you can annotate DNA with that we're mostly talking about proteins. The reason that we're doing a protein search is just because one we can make it a bit quicker, I'll show on the next slide but also proteins allow for more conservation over time so it would be more noisy with a blast sort of end search. All right, does that make sense? Yay. Oh yeah, sorry, so is that a... Geez, that's kind of brutal aren't I? Yeah, sorry, so is that a Curtis Hutton-Hauer's group? So yeah, that's really weird. Sorry, I thought I had it listed. Okay, yeah, so sorry, humon is out of the Curtis Hutton-Hauer groups. It sounds like human only. I should have mentioned that as well but people do use it not just for humans because it is the keg database so there's nothing really focused just on humans. They just happen to call it humon and there's some long acronym and I don't know what it is but you can look it up. And it was used though for the human microbiome project as their main annotation tool. Okay, so I said earlier blast is slow and so people have been developing faster methods to do faster similarity search. So after blast came out, there was faster tools like has anyone heard of Blatt maybe? Blatt's a tool that makes nucleotides searching a lot faster. There was this other tool called Wrap Search and then Wrap Search 2 which does protein searches even faster than Blatt. And then more recently, Diamond came out, man I don't have my references in here, sorry. Last year. And they claim that they had a massive speed up over blast x. So this is a figure from the paper and they show that compared to blast that you get almost a 20,000 fold speed up over blast x which is pretty amazing. And then they also show in these two figures that hopefully they still get the right answer. So this just shows different types of data and the bar is to show the correct queries mapped using the search methods. So they compare against Wrap Search 2 as there may be a different comment. So they're doing fairly well across the queries mapped and then if they look at the matches recovered from blast that they do fairly well. As well. Just from our own tests, basically Diamond does really good if your similarity is fairly high. So it seems like anything above sort of 80% would do very well. If you're looking for really divergent sequences say in the 30 to 50% identity range, Diamond doesn't always give you the best hit because it does do things to speed it up. So if you're looking for very diverged proteins, you'd be probably better off sticking with blast and just waiting. But with 20,000 fold increase, I mean basically this means that we can get stuff done without a massive cluster sometimes. So in the example today, you'll run things running in about a few minutes and then we calculate it one time and it would take blast about a couple of days to do that same search. So when you have so much data, sometimes you just have to use whatever will process the data. Okay, so the next step that, so the first step is to do that search and that currently is outside of Humon right now. So you'll do that search in your tutorial and then the second step is then to start to process these searches and then start to come up with hopefully what you want at the end is both pathway and module information at the end. So the first step they do is they try to normalize and wait the search results. So you can imagine they have all these different hits to different things in the database and they try to normalize those hits using a few different approaches. So one, they obviously take into account how many reads map to the gene sequence in that KO and then they wait those by the inverse p-value of each mapping. So the lower the p-value, sorry, the higher the p-value, then the less weight it will have to normalizing that data. And then the other thing they do is they take into account the average length of a K-gorthologue and then they divide by the average length of that K-gorthologue to come up with a normalized score. So right at this stage already you're going from real read counts to all of a sudden some decimal number that's been altered based on the similarity to the K-gorthologue as well as the gene length. Does that make sense? Right. Okay, so the next step is then this problem where I mentioned how do you take into account if you have one or two K-gorthologs of a pathway, how do you start to count that? So as I mentioned already, a K-gorthologue can map to one or more K-gorthologies but just because we find one K-gorthologue or a few in a pathway, it doesn't actually mean that exists in the community. And that's important because maybe that pathway represents something that's not actually actively occurring. So for this they just basically wrap a previous tool that was developed called MinPath and MinPath attempts to sort of remove spurious hits. So their idea is that you have all these K-gorthos mapping to different pathways and you try to remove pathways that aren't well covered. The next step is that they use organism information from the K-gorthologies. So they look at pathways that are not found to be observed in any organisms and are also made up mostly of K-gorthos mapping to a different pathway are removed. So this is the idea that if most of the K-gorthologs look like they belong to this pathway and they're not known to be associated with other pathways, they just basically remove them. The other thing that they take into account is this estimated copy number. So this kind of relates back to six-to-nest copy number. So you can imagine if a genome has multiple copies, they actually try to normalize for that. So they're only representing a single copy of a K-gorthologue within a genome. And then finally, they do one last step where they try to do some smoothing and gap filling. And this is the idea that they know that people with metagenomics have not usually not done enough sequencing to sample all the genes. And so just because you're missing a gene from a pathway, it doesn't mean that's actually missing your metagenome. It could be just that you've had low sampling depth. And so they try to increase K-gorthologs that they think are actually present that just haven't been sequenced. Okay, and then the last step is that they basically have now two types of data. So one is pathway coverage, which is a presence absence. So this is just saying for the metagenome, do we think that this pathway is at least covered completely and that's just a presence absent thing. And then the abundance actually takes into account the relative abundance, which is much more informative. So usually I don't use coverage for anything. And I want to look at the relative abundance of both pathways and these cake modules. All right, does that make sense? So that's Humon in a nutshell, great. All right, so what about assembly? So I think we'll get us to go to assembly a little bit yesterday. So just as a quick recap, assembly is the process where we have these fragmented reads and DNA. The idea is we're gonna find overlap between the reads and the whole thing is that we assemble these reads into context. And then later again, you can do assembling context possibly in the scaffolds where you don't know for sure if something, what the sequence is between the context, but you know the relative difference. So what about assembly? So assembly for metagenomics. So there's pros and cons to whether you wanna do assembly first and then do functional annotation or whether you do functional annotation directly on the raw reads. Okay, so this is a lot of my thoughts, but this could be interesting if you have questions. So on the pro side for metagenomics, there's less computational time for that similarity search. So since you're collapsing now millions of reads into an assembled contig, that means when you actually do your functional annotation search, then you're only comparing this collapse thing to your database. So instead of millions you could have maybe thousands or 100 to a thousand less fold. The easiest is if you can imagine if these are all 100% identical reads, then you wouldn't wanna have to do that search over and over and over again. You could just do it collapse them first and do the similarity search once. So that's a benefit. If you have short reads, especially this is a little bit more worrisome when high sequence generating like 75 base pairs. But since you assemble these reads into longer contigs, it could increase your ability to actually assign annotation. So you can imagine if you have short reads that that limits your ability to actually find a search to a protein. The other cool thing about it is that you can actually sometimes reconstruct genomes. So that's useful somewhat if you just do assembly on the whole thing, but some people will try to do some bidding first and then assemble the leftover with the hope of actually pulling out at least partial genomes from your metagenomes. Okay, so the pros and cons is that although you save time on the similarity search, assembly on full metagenomes is really computationally intensive. So you actually need a pretty big machine with a lot of memory and those aren't always available for all people in all labs. So unless you have access to machine with at least 128 and probably more like 256 or 512 gigs of memory, assembling a full metagenome is probably out of your luck. But if you know a buddy, then you're good. You have to keep track of the collapsed reads. So this was kind of a weird point, but for a while people were doing assembly and then they would just throw out the fact that multiple reads map to different assembled pieces. But now there's some assemblers that will keep track of the reads back to the assembly or you have to do it yourself. So you can imagine if you collapse all these reads into the assembly and then you annotate that, you'd actually want to keep track of the reads that originally mapped to it. So some assemblers will do that, but in others times you just have to do the mapping back to yourself and keep track of those counts. Low read depth and high diversity can cause assemblers to fail. So again, you have this problem with shallow sequencing. You have thousands of different things. The chance of it just to fail and not work very well is very high. And what else do we have? Chimeres. So the hope is that the assembler would do a pretty stringent job of only assembling things that are from the actual same genome. But there's a chance that it could assemble things from different strains or even maybe different species if it thought that the overlap was good enough. And so you actually could get these chimeres of things that don't represent sequences from the exact same genome, which could cause problems downstream. Okay, and lastly, there's a chance for this, although I still haven't seen any study to show if this is a big problem or not, is that if you do assembly on whole metagenomes, you can imagine that you would be able to assemble some of the more dominant things in the sample where the more rare things would be not assembled as much. And so if there's, that could change how you do functional notation. So it could bias your ability then to label functions in the more rare things because those are shorter reads. Although I don't have any proof that that's actually effective or a problem, but just something to keep in mind. Genome complexity with. Genome complexity? Yeah, the bigger regions and stuff like that. Right. Hard to assemble. Yeah, absolutely. So, and then that would affect your functional annotation as well, right? So you, right, you would miss. So 16S ribosomal operons or ribosomal operons are really hard to often annotate or assemble. And so because there's multiple operons usually, and so they'll get left out of assembly and then all of a sudden you can't annotate those genes if all of those are RNA genes. But yeah. So yeah, it's just something to be cautious of, but still assembly is still quite often they use by a lot of metagenome pipelines. So as far as I know, JGI still puts all of their metagenomes through a gene assembly process to get contigs before moving on with their parts. My preference right now is to just do it on the raw reads and it removes some bias that way. Any thoughts on that? Any thoughts from the speakers about assembly versus not? John, you're an assembler, aren't you? Okay, do you guys assemble your gene? No one? No, I do. Well, so either, so there's two things, we just didn't assemble very well. And then the other thing is that I'm also worried about that last one, right? We have, especially if you're trying to do comparison between two environments and you have one that has more organisms that are simpler and you have one that has really large organisms that aren't, you know, that are having more variety, one is going to assemble much better than the other. And then that could be attributed to amputation points as well. Right. So. I can see it comes down to how well the actual, how we're going to perform across different ways. The assemblers. The assemblers, and the assemblers are much better than the assemblers. So there's great matches in there now. Right. So that they perform to connect to get over some of these issues. I've used a few different assemblers, because that's the negative, it's another one, feel it for assemblers, that's really good. For viral data, we assemble them because we have short reads, it was high-speed. Right. And in that case, you just have to accept that your assemblers will be mined and then do your analysis knowing at a time that the assembly might be mined to results. But as long as we can keep in mind your biases during analysis, right, you just have to do for the viral data you had to assemble. Yeah, and if I want to come back to the pro that I think there's useful things to do with assembly and metagenomics, but if you're really interested in just getting to sort of the taxonomy table and then the functional table by themselves, I think you can do that with a raw reads and that's what the tutorial show you. But there's other times where you might really want to do something that requires assembly, right? So this idea of binning reads possibly and then reconstructing genome somewhat or if you need something where you actually need the longer reads, like if you're trying to find LGT or something, then that really requires an assembly. But you have to remember that assemblers work great on genomes where you know that they're supposed to fit together most of the time. I mean, assemblers have to handle chromosomes, right? But originally they weren't really designed for metagenomes where you have hundreds or thousands of chromosomes or different genomes, right? So you just have to keep in mind that and these things can be really similar. So if you have a population with different strains, you could get these chimeric assemblies that kind of scares me a bit. Okay, so what about gene calling? So that's the next step. So in genomics, normally you would take your reads and you would do assembly and then you would do a gene call. And so for gene calling, if you're not familiar with it, this is just the idea that you're gonna try to predict the start and stop side of an actual gene within your genome. So you can imagine an open reading frame is identifying nucleotides that code for amino acids and then you gene callers will look for open reading frames over a certain size without a stop codon. And then they use gene models from other sequence genomes to identify the probability of that gene being a real gene. So for metagenomics, some people might then wanna do a gene caller first before then you try to actually assign functional annotation. So the pros in that approach is that you may result in less false positives from not annotating non-real genes. So you can imagine if you have a DNA sequence, you find a match in some open reading frame, but it's mostly like a domain or something to another gene, but it might not even been a real gene in the first place, especially if it was a shorter hit. Whereas you did a gene caller, then maybe you would have less false positives that way. Again, it would slower the number of similarity searches. So it cut down on computational time on that end because you actually have less things to do the functional annotation with. I began again, similar to assembly on the cons, the gene callers are quite computationally intensive by themselves, right? So it takes a long time. I like my English, no good learning dataset. So usually with genomes, people would build gene models based on sort of known model organisms. And so they would train these models. And with metagenomics, all of a sudden you have things that people have never seen before, right? So genes you've never seen before or organisms that you've never seen before. And so there's not great gene model. It's really hard to come up with a nice learning dataset to build those training programs for those. The other big thing is that, often if you're looking at raw reads, right? A gene call, you wouldn't actually necessarily have to start, might not have to stop, or you might even have to start or stop. You might just have the fragment from the middle. And so that causes the problem for gene callers. So because of that, people would often do assembly first. So the people that tend to assemble, they sometimes often also use a gene caller. And so the gene callers that are out there, specifically for metagenomics, will allow for, say, the starter stop to be missing so they don't actually have that restriction. So I mentioned metagenotator here, as well as frag gene scan. And I think there's been others as well. So the alternative is to not do gene calling at all. And you just say, do your six frame translation using BlastX or Diamond. And that's sort of the approach I do. Yeah. This is mostly for vectoria, yes? What if you have to do it? The gene calling? Yes. No, I wouldn't say that. Actually, I don't know. I think it would work. Does anyone have any thoughts about that? So there's gene callers for, obviously, for eukaryotes. I'm just, I don't know if they've been adapted for metagenomes. Yeah. Yes. Interesting. Is there a specific organism you think? No. Well, because we're like, so I, I guess the issue that might be how much gene I have in the gene signal, so it's one of my favorite signals. And that's just specifically for me. Yeah. Rich. Yeah. Yeah, I think that might be at a better approach where you, exactly, you try to filter your eukaryotic reads first, and then say these are eukaryotic ones, so we're gonna do something different with them. You can still carry out assembly and then at the counting level, identify whether you carry out it. Yeah, it's a good point. I think if you, if you thought, if you were going after eukaryotic assembly and then trying to do gene-calling that way might be the better approach for you carrying out. So we work with some Jordanian samples that have very few introns, but nevertheless do. So we treat them set, well, not that we treat them set, so we have to use first gene finders that are more specific and do eukaryotes and the ones that are really interesting. Right. Anyway, good point. Yeah. So how do I get changes in parameters in my data? So I'm starting to run and get that out of there. So as we do this bit of a thing, what's the difference? Something that can be totally different with the stuff you're talking about here. How does it show in the data? So if you assemble it, what does it show? Really important. So that's the thing. So the classic example is you would get to this point where then you have a table, right, with these functions, and then people group the functions into their different categories, right? So similar with what we got with PyCrust with this K-gorthal log table with multiple samples. And then people will do statistics that say, oh, that's interesting. We see the microbiome has a greater proportion of these particular carbohydrate metabolizing genes, right? And so we think. That's right, yeah. So the classic one is something like butyrates, right? So as humans, we know that butyrates are kind of important for its link to colorectal cancer and it's known across the intestine. And so it's seen as a sort of healthy aspect. And so you can look for genes that are in butyrate production, right? So butyrate kinase or something. And then people will then show that maybe the microbiome is making more or less say butyrate, which has been linked to other diseases. So that's interesting. But it could be other functions too that people have no idea, right? So you're going to the pie crust data. But so you start to make decisions about that data based on what the microbiome as a community is doing. So that's in the sort of what I would call the most straightforward analysis. That's what you're gonna be doing in the tutorial. And then there's other things that people start to do maybe with the genome reconstructions where maybe they're really trying to describe, say, a new genome that's never been described before, and so they try to do metagenome reconstruction or they're trying to link the functions back to the specific organisms through the assembly or something like that. But I mean, there's all kinds of cool stuff you can do with the sequences beyond just making the functional table. But that's what I'm focusing on here. That make sense? Okay, cool. Okay, so this is my sort of pet peeve number two. It's kind of like pet peeve number one this morning. So just so when you're running papers again, just to make sure that if you're doing a metagenome, we're not talking about actual transcripts or proteins, right? We're measuring genes. So the big thing around that is we're talking about sort of function potential. So just because of course we see an increase in the relative of genes within the microbiome doesn't necessarily mean that those are being actively transcribed or then translated into proteins which are then having an actual impact in the gut. That being said, I think people tend to think, this will be interesting to have opinions about this, that the community in general is pretty selective on the organisms and that if you do see an increase or decrease in maybe the presence of the genes, you can assume that those things are just actually being transcribed. Not, I don't know, that's not the right way. That they could be useful and that it does show sort of potential for a gain or loss in function. But you're not actually measuring the transcripts without actually doing the metatranscriptomics, right? So you just have to be aware that you're not actually measuring, say, butyrate levels, right? In any way, you're just saying, you see an increase, say, in butyrate, the ability for the microbiome to generate butyrates and you would have to measure the metabolites separately or the transcripts separately. Does that make sense? Okay, good stuff. All right, so that's my summary. I have a quick summary slide. So remember I showed this before, but I think this is where we're at now over the two days. So the hope here is that you can generate both OTU tables using 16S data or CHON. This morning, I showed you how you take metatranscriptomics data and use Metaflan to get the same sort of taxonomic tables, so it's not OTU too, but it's the same sort of data, right? Where you have things grouped into a species or a genus or family. And then from there, you can use STAMP to generate these types of plots. And then yesterday, I showed you how to get from the OTU table from 16S data using PyCrust to get to the functional. And then now, we're gonna show you how to go from Chocoman genomics data, and using Qmoh and Ndiamond to actually get the K-Borthalart table through that method. And then you're gonna use STAMP to check out some of the functional differences. Does that make sense so far? Okay, any questions? Those are not STAMP's output, I think. Sorry, what was that? Those are not really STAMP's output, I think. No, that's a bit of a lie. This one is, that's not, this is, that's CHON. That's CHON. I made this an error, this is CHON. Yeah, this is like a hand waving STAMP idea, not like STAMP, sorry. STAMP slash other tools I should put in here. Yeah, try it. Well, they do quite different things for visualization, you mean? Well, so they just give you sort of different things. That's not a very good answer. So for like, CHON does a great job with Unifrack and their PCA plots, you know, with emperor and stuff, that's pretty slick. But if I'm looking at separate taxa and separate functions, I would like STAMP for that, if you can go through it and it draws it for me, whereas with STAMP, with PyCode of CHON, you can do that as well, but it's a little more clunky, you have to make each plot from the command line and it's a little more clunky. You can make key maps in STAMP and I don't think you can make key maps in CHON. That's what I remember. And then really big key maps, I would prefer to do in R, because that's just what I like to do. Set bar charts, I often do in CHON, when you're looking at multiple, but you can do those other things too. So I think it's just what you kind of use too, yeah. Okay, so for the lab this afternoon, like I said already, there was the same starting dataset you're gonna look at and this time you're gonna do diamond similarity search and then you're gonna look at that output and then you're gonna do a Humana to get your pathways and modules and you're gonna load that up into STAMP. There is a step where for all the samples, we mentioned that you don't actually have to run all the samples through diamond because it would sort of take too long. You can give it a whirl. Actually, I put how, I think I update how long it is. I think it's actually only about 10 minutes for 10, 15 minutes. So if you wanna feel satisfied, you can just run it and not use the pre-computed results or maybe run it while you're doing something else with the pre-computed results, but you'll see how it gives you your output. So I'd encourage that. So the Amazon cluster has eight threads, right? So eight different threads and quite a bit of memory. So you can parallel things up to eight times and really punch through the data. It's kind of satisfying those two. I don't know why it's as fine to me. I bought this machine that has 48 threads. I just love just loading it up with data. I can't even see the machine, but I just picture it in this place just going. Anyway, it's just me, I guess. I'm just trying to figure out if there's any other questions or comments. No, yeah, so I'm sure there's questions or comments. Oh, there is one thing I wanna show you. So that's the end of the lecture, okay?