 I think we're, I think we're good for day two. Everyone feel rested from yesterday. Yeah. I'm feeling good. Yesterday was just the warm up for today. Today's the real, the big challenge. No, no today should be good. It's actually a little less busy, right? We just got a nice solid module in the morning. Nice solid module in the afternoon. No student talks. It's going to get down to work today. So yesterday went fairly smoothly. And I, I think today we should go good as well. I tried to cut down some of my slides because I think we had some pretty good questions yesterday. And so a lot of people coming. I probably met genomics from different angles. So I'd encourage you to ask questions again, just, just like yesterday. And I think that'll be really a benefit for everyone. Okay. So with that, we can get into things. Okay. So we're going to be talking about metagenomics. And learning objectives are fairly broad here. So basically I'm going to start with just contrasting metagenomic from ampcon sequencing. I think we have a pretty good handle on that, but that's always good to understand. Then we're going to describe sort of general approaches for taxonomic assignment. So this is very similar, I guess, to what we did yesterday with six NAS, right? Trying to assign taxonomy, but obviously metagenomics is different. And so there's different challenges and there's different approaches for that. And then today we'll be our first jump into, you know, microbial functions, not just looking at taxa, but also at microbial functions and how we, you know, what we mean by microbial annotation of functions and different approaches for that as well. Now, if you thought there was a lot of options yesterday for 16 s, there's going to be everything today is going to be like, well, yep, you could do that, or you could do this or you could do that. So 16 s actually has a pretty solid workflow, I would say some variation in there. Metagenomics is pure freedom almost in that you can do a lot of different things and a lot of cool different things. And so I possibly can't even talk with them all, you know, in this lecture, but happy to talk obviously through questions and during breaks and things. Okay. So yesterday with 16 s, we all really understand that as well established. It's really relatively inexpensive. Right. We talked about that sort of. Depending on where you get your sequencing sort of 20, $40 a sample. And that's because you don't need a lot of sequencing reads, right? You might need 50, 100,000, maybe more depending on what you're doing. And we talked about how 16 s or other applicant sequencing only amplifies what you want, right? Depending on the primers you choose, you are sort of selecting what you're going to get back. Whereas Metagenomics now is really this idea of just sequencing all the DNA sample, right? I mean, we can talk about different ways to maybe enrich for certain organisms, but at the heart of it is that there's no specific primers going towards a certain gene or target, right? It's just taking DNA, extracting it all and sequencing it. So that's great. We don't have to worry about then, you know, what our target is. There's no, in theory, there's no primer bias. The other big benefit is, whereas before we sort of had to pick about, you know, what microbes we're going after. In theory, we get a whole picture of all the microbes in that environment. So we can study viruses. We can study microbial cariods. If they're there, we can study bacteria and archaea. Other big thing is probably, it's widely accepted that you would get better resolution, better taxonomic resolution. So we talked about, like, say, 16S getting down to maybe that genus level with man genomics, we can often get down to a species level and maybe even sort of some strain identification. Depending though, of course, lots of factors about the depth of sequencing and how dominant that organism is in the sample, but usually better taxonomic resolution. The downsize, of course, is that it's more, more expensive, right? So we're going to, we're going to sequence a lot more. So for a typical, maybe, a genomic experiment, you're talking about five to 10 million reads. It's not uncommon to hear people sequence, maybe 100 million reads, depending on really the platform that you're sequencing on and the environment that you're looking at. So because of that, you're going to incur the cost of that extra sequencing. So now we're talking about, you know, price range of 100 to maybe 150 to $200 a sample, depending on, on sort of what you're, what you're looking at. The other big benefit is that besides just identifying sort of, who is there, it also identifies functions, microbial functions. So you can think about, you know, just what the, what those microbes are doing, which is really a different lens on it. And then the other really big thing is that you can reconstruct genomes. And so that's going to be the topic for this afternoon about how we can try to, you know, put the pieces back together and assemble genomes. And there's probably other advantages here to my genomics that go beyond, but that's sort of our main contrasting points. So I thought I'd just sort of lay out some of the different techniques or broad approaches that you might think about using metagenomic data for. So one is this process where we basically use reference databases and mapper reads to those databases to come up with tables, either taxonomic information or functional information. Right. So this is kind of similar maybe to what we think about yesterday. And it's probably the, in one form or the other, what we do in a lot of cases, right? We have our reads, we're going to do filtering. We'll talk about that. We have some database. You use some sort of mapping software to get to these. And then you end up coming up with, you know, new tables of say, taxa particular functions. And then you would then do downstream analysis on those tables. Some of the same approaches that you would have learned yesterday about alpha and beta diversity. But you can also do a lot more interesting things with like networks and co occurrence. But yeah, so that's sort of one broad approach. And that's primarily what we're going to focus on this, this morning is, is this sort of approach. There are other approaches out there. So if you're really interested in like specific genomes, I know there's a few people in here like that we're studying sort of like, you know, particular pathogens and understanding whether the presence or not. And so there's different approaches you can do for that as well, where maybe you have a genome of interest. You're really focused on that. And then you can start to use your metagenomic data and map to it. To find out one, is it really there and two, you know, the genome I have, is there differences with one, maybe the genome that's in the, in the sample. We tend to see still mags probably dominate this approach of it, but still quite useful. Probably the least. I don't see this approach as much as the call fragment recruitment, but it does give you an idea essentially if, you know, if you have good coverage of your genome, whether there's parts that might be missing suggesting that, that genome is different. And whether there's, you know, just maybe sparsity on that genome and it is or is not there. And then the topic, which we're not going to talk about this morning, but this afternoon I alluded to is this idea of reconstructing genomes. And so with this, really what we're trying to do is, you know, take a raw reads, assemble them into contigs, bend them, and hopefully do some quality assurance and get back to essentially, you know, where we started with the original genome. And of course the advantage there is that we're not culturing, right? We're just literally taking samples, very high throughput. It's noisy and problematic, but it's very attractive in that we're, you know, we're skipping all that altering stuff. So Laura will talk about that this afternoon. So let's just jump right into what we're going to do is I'm going to split the talk sort of into taxonomic profiling first, and then functional profiling after that. Okay. So for taxonomic profiling, you know, we're trying to take this raw data, and we're trying to get to a similar table like yesterday with taxa by samples. So there's many challenges identifying taxa from genomics data. So one obviously the reads are randomly assorted, and I understand that the reads are usually short. Usually it depends if you have maybe long read data, but usually they're shorter. We're talking about 100 to 150 base pairs, right? So you can imagine. If we're not assembling that, there's limited information to say this piece of short DNA came from this organism. We might have spotty genome coverage due to sequencing depth. So even though we sequence quite a bit, we're going to be able to sequence, you know, all the organisms in your sample unlikely to an adequate depth where you get very good coverage across all your genomes, right? So if you think about the most dominant organisms, sure we might have pretty good genome coverage. And then there's going to be a long tail of more rare things and we're not going to have really great coverage on those. So for some organisms, we're going to have, you know, tens or maybe hundreds of reads to that, to that genome. And so we're relying on that information. I mean, that's, again, doesn't go for all cases, right? I mean, you might have a very well-defined system with only a few taxa, but in most cases, we have hundreds to thousands of taxa. For taxon-exotic lateral gene transfer, there isn't really talked about a lot, but it's obviously a really big problem, right? So if you have a short piece of DNA and you're trying to annotate it that it came from a certain genome, if that piece of DNA was recently horizontally acquired from another genome, then that might not, it would be misleading, right? Because essentially it looks like it's from this genome, but it actually was horizontally transferred into a different genome. And so horizontally transferred DNA, you know, aren't really great as a, say, a marker or an indicator that the taxa is there. And then the biggest thing really, I guess, is probably computational time. So unlike 16S data where it's pretty manageable, I know we skipped over a couple steps yesterday, but the reality is, is you could probably install chimes on most of your laptops, could probably process like a 16S data set in a reasonable amount of time on a reasonable computer, right? My genomic data is much larger, right? So a lot of reads and then the databases are very large too. And so that actually just has real restrictions on the approaches we have to take. So the computational methods will use heuristics, they'll use approximations to make them faster, but of course they're not as perfect that way. It also means that for many tools, you're not going to be able to probably do it on your, on your laptop, right? So there are some tools, we'll talk about the difference later that you could, but you might have to go up to something a bit beefier, right? So you might have to get access to, you know, a cluster somewhere, whether it's an AWS instance that you're purchasing, or maybe you're going to CompuCanda, which is now called the digital alliance, right? They, they offer computational clusters for free, or whether you know somebody with, you know, high performance computing. We'll talk about that a little bit later too. Okay. So from the taxonomic side, let's talk about some of the initial bioinformatic processing steps. I'm not going to talk about, you know, repetitive things here. So many of the same filtering we talked to the yesterday was six nest data applies to my genomic data. Usually the first thing you're going to try to do is get rid of reads that just don't look high quality, right? They just, you might as well get rid of them early. It speeds up things downstream. Plus they don't, they don't come through the other end. So you would apply different, different filtering techniques on that data. So you would also demultiplex, do some lane merging. I always talked about stitching reads yesterday and I talked about it's already, but usually with med genomic data we don't join reads, right? They don't overlap usually. And so because of that, if you're say using Illumina data again, you would just take your forward reads and your reverse reads and in most applications or many applications, you would just simply combine them into one file. Some tools will take the forward reads and the reverse reads and know what to do with them. But at some point, probably in your pipeline, you'll end up merging them together, okay? We can talk about that a little bit more later, but the only tools I really use them and I know of it are actual like DNA mappers like BWA or bow tie. They know how to handle the forward and reverse reads, but a lot of other similarity tools don't. Okay, so what's unique here with med genomic data is a couple of, sorry, is really this idea of removing unwanted, what I call host associated reads, right? You can think of this as maybe contamination, I think contamination people think of other things. But essentially for many studies, not all again, like many of your reads will be associated with some sort of host. So if it's human, that would be us humans, but you could be studying bats, it could be a plant, microbiome interface, anything where there's a lot of DNA coming from a source that you really don't want in your analysis, then you might want to consider removing that DNA. So something for like human stool, no problems at all. Human stool is mostly bacterial DNA. You can do a filtering step, but it's not going to lose too much. Let's say if we're talking about saliva or skin microbiome or if you have like a plant associated microbe where you're taking parts of the leaf, you know, those eukaryotic cells have a lot of DNA in them. That means they get sequenced quite a bit. And now all of a sudden it's not uncommon to see a large portion of your sample have sort of host associated reads, right? So for oral microbiome, we're talking sort of 50 to 80% of the samples of the DNA that you sequence would be host associated, would be human associated, right? And it's a limiting factor of metagenomics. It's probably one of the only real downsides to it. Besides its cost is the added cost if you have a lot of host contaminated DNA, right? So it sounds great to do metagenomics, but if, you know, you have 80, 90% of the DNA being post and you have to throw that away. Well, then in theory, you have to sequence a lot more to get enough microbial reads. Okay, so to do that is fairly, you know, straightforward, especially if we have the genome. So for humans, we have human genomes for mice. We have mice genomes for plants. Some plants we have their genome. If you don't have it, you know, if you're studying something more exotic, like, I don't know, it's exotic nowadays. Dolphins, is there a dolphin genome? I like the problem is, I don't know. So, you know, if you don't have the host genome, that's problematic. You're probably going to want to take, you know, the nearest phylogenetic neighbor and try to map against that. And then usually you just use what we call a DNA mapper. So these are pretty common bow tie to BWA. It's fairly straightforward. And usually parameters around that, we want to be fairly lenient, right? Because we want to make sure we, any reads that really tend to map to that genome, we just want to take them out. And then I also usually mentioned phyax is often used as a sequencing control. Sometimes we see it still a little bit left over. It should be removed by the sequencer with alumina, but sometimes we still see it. So we end up usually including that in our database of things to screen against. Okay, so does that make sense on the screening thing? Pretty straightforward. Yes, John, not straightforward. Yeah, so I don't know if everyone heard about it, but sort of ethics and the fact that if you do human genome sequencing, or sorry, human microbiome, you're going to get a lot of leads to the human genome, right? And that causes a bit of a problem for ethics sometimes. And just thinking about how to handle that situation. So it is true. Actually, we had a study a while back where we actually used, we did biopsies of intestinal samples for IBD. We actually used like the 90% of the reads for the human genome that mapped to it. Instead of just throwing them away, we actually use it to genotype individuals, right? And we came up with like known risk factors for IBD and then combine that with microbiome data. So there's a lot of strength there. And then recently there was a paper that sold on my brains not functioning yet. Used, what was it now? I'm drawing a blank, but what I was going to talk about. But they used the human sequencing. Oh, what was it? I don't know. Oh, it was a privacy concern. So they show that you could use the human sequencing to identify the individual from the microbiome sequencing. So from an ethics consideration, it's been sort of, I guess it depends on who your ethics board is. So ethics boards tend to be, you know, microbiome is cool. They don't care. And as soon as you say you're sequencing human genome, ethics boards get very like, they have a lot of rules for that. That's my experience. And then for most people to get a rat, I should never say get around the way to handle that is to clearly saying your ethics that you're going to remove those human reads, possibly discard. And then if you're going to upload the data, you would be uploading raw data, but that's only microbiome. That's only microbial in the hopes that all the human sequences were filtered out. That's how I've seen almost all those cases applied. And most ethic boards seem to be pretty happy with that approach. Yeah. Yeah. Question. For mitigating. What was it? Sorry. Barcode. All like hopping ricks. Yes. Yes. So, um, so this idea of barcode hopping is interesting. So the idea is that as we talked about life with bar, barcoding samples and multiplexing, it's been shown that sometimes you can get barcode hopping. So the idea is that somehow you have a sample as barcoded, but the, the barcodes and sort of like how they get attached to get. They hop to the other samples when they're combined. Um, it's an interesting problem. And I would say most people don't worry about it. But I think because of the level it's at is certainly low. It does mean it's problematic and that's, um, if you have very different types of samples, you know, you might be able to identify those, right? So if you're multiplexing like human versus. Ocean or some sort of like water, you could maybe pull those out, but there's not, there's not a great solution though. We do know how to identify those samples. Um, and so a couple of things that, uh, some of my concerns is probably lower on my list. I guess, is my thoughts on it. And again, I thought it was fairly, uh, focused on my seeks recently. I don't see. I don't know if they see it as much in the next Seal Nova Seeks. But yeah. Okay. So, uh, moving on taxonomic assignments, uh, we're gonna talk about reference-based approaches. So this is your take reads. You want to get to taxonomic assignment. reads approach. So the idea is you take all your reads and for every single read, you try to really map it to a genome and you try to assign taxonomy to every single one of your reads. Even though you're not going to, that's the attempt that you're trying to do. Okay? So this is going to be tools like Kraken and Center for Future. We're going to talk about those. So that's difficult in that some of those same problems I talked about earlier. So LGT is a problem in that case. If it's a repetitive DNA, obviously, you're not going to be able to map that fairly well. And you're going to have a lot of reads that just don't map the database. And so they don't have homologs. And so you're not going to be able to assign taxonomy to those. Okay? But there's another way to talk about taxonomic composition within a metagenomic sample. And that's a marker based approach. Right? So instead of trying to annotate all of your DNA, you only use some of the DNA, some of the reads to figure out the composition within a sample. So for those, you have to define what those markers are. And it's dependent on those choices of markers. So that sounds a little abstract. So I'm going to hammer down a bit more to what I'm, whatever they mean. So for marker based approaches, there's sort of a few different angles. One, you could do this like single gene approach, you could actually extract, say a 16S gene or a CPM 60 gene or other universal genes. And you could actually, you know, use that data, process it through chime, and actually define taxonomy that way. Now that's not really a great use of your all your sequencing, right? You just spend a lot of money on metagenomic sequencing. So it's usually you don't see that that often. And what you see instead, a lot of is sort of using multiple different markers. And so one approach is this MOT use approach, which uses 10 universal single copy genes and determines tax tax on me that way. And then this other approach, I'm not sure why the font sizes are all over all of this place is metaflan, which is probably the most popular that uses this idea of clade specific markers. Okay. But just get this in your head that metaflan is not trying to annotate all the reads, it's just using specific markers to then identify to figure out taxonomy within within your sample. So I'll just talk about metaflan for a bit more. So metaflan, if you've never heard of it comes out of her son Howard's lab, it's on its fourth iteration at this point. It's very popular, especially I would say in human microbiome studies, but you do see it applied across other environments as well. So I just have some notes here, it's a lot of text, I'm sorry, but it's, I mostly just extracted it from the most recent paper to get a handle on what this means. But what they start with is they combine, they make a gene catalog essentially, and they combine one million, they state bacteria on archaea genomes. 25% of those genomes are from some sort of traditional isolate culturing or single cell sequencing. And they actually use now 75% of those genomes are actually derived from metagenomically assembled genomes. Okay, so sort of, as you'll see lady have taken metagenomically assembled genomes with a grain of salt, but they start with all those genomes. They then group those genomes into different bins. So you can sort of think of these as species, they call them species level genome bins or SGPs. So instead of worrying about taxonomy and all of that, they basically just use a 5% genomic identity, just meaning that if the genomes are within 5% of each other across the whole thing, they collapse them into one of these bins. They then have 22,000 known SGPs, meaning they know what the taxonomy is. And then they have 5,000 unknown SGPs where they don't have a taxonomy label, but they just, you know, they give them an identifier. Okay, so from there, then they extract essentially these markers. And so what they're looking for is particular genes or markers, we'll just use the term genes because it's more easier to think about that are unique to that specific SGP. And that is always found within that SGP. Okay, so you have a group of genomes at the species level, they essentially look for any genes that are always found in that and are also unique to that species. And they use those markers as an approach to then call your data. So this is all done ahead of time by the developers. And then for every SGP, they, you know, they have anywhere from sort of like 10 to 200 different genes that they can look for in your data to say that this SGP is there. So then that's all done in the background. Essentially now what you do is you come in with your data and essentially your management leads are then just searched against their list of markers using go time, which is really fast. And then they do a bunch of filtering and normalization to then produce your taxonomic profiles. So that sort of makes sense for everybody. Yeah. Okay. Lastly, it is limited to bacteria and archaea. They, I think back in version three or two, they dabbled with viruses for a while, but I see they sort of, I think given up on that approach. So if you're interested in phage or viruses, this doesn't help you. If you're interested in microbial eukaryotes, this doesn't help you. So just to understand that that approach. Okay, so we'll come back to metaphor and a little bit and compare it to this other approach. But the other major approach is an all reads approach. And this is again, where we're searching all reads against a large database. And so there's quite a few tools out there, I would say the most well-known one is probably Kraken and its partner Bracken. You might have also heard about centrifuge or Keju. And these are still coming out to this day. But all these are essentially sequence similarity matching tools. And they use different computational approaches to make this faster or more sensitive. In almost most of these cases, they use a kamer based searching strategy. And I'll just briefly cover that just so you sort of have a foundation of what that means. And they use some sort of often a lowest common ancestor approach for assigning taxonomy. And I'll describe that as well. So for kamer based approaches, the idea is you have your database of genomes, you take, you know, one of your genomes and you just split it up into fragments of overlapping length. Okay. So for this particular case, you split them into five. And so this would be a five mer. Okay. So kamer is just representing, you know, the fact that you're going to split it into certain lengths. And when people develop the tools, they'll basically, you know, explore the different lengths of kamer to see which is most optimal. So they, they divide it into these kamer's. And then, wow, I really skipped over a whole bunch there. And then it seems like it produces a lot more data. But from a computational approach, it actually allows you to quickly search. So what happens is like, say if you're using a kamer or five, when they search for kamer, they look exactly for that five. Okay. If you use a long kamer of say 11, like 11 mer, you'd have to have all 11 nucleotides. Exactly the same. And then, yep. So verify. So this is all on the database and genomes that you have yet not on your reference sequence. That's really good. So I shouldn't, I should have clarified. So this is all on, you know, you have a big database of genomes that you're starting with, and you're optimizing how you represent that data before you, you do the comparison. So this is all done sort of ahead of time building the database. Good question. Anyway, the biggest thing to understand about kammers is that it does allow very fast matching. Okay, it's way faster than something like BLAST, which allows, you know, missing nucleotides or insertions, deletions, right? It's looking for exact matches, which we can speed up using interesting computational approaches is how I'll describe it. Okay. And the reason I'm talking about this is most of these tools, and that's how they're really different than something like BLAST or other approaches that use a kamer based strategy. Okay. Now I'm not going to get into the more details of it, but then they basically use multiple kammers than to your sequence that you give it to say whether that's a good hit for this particular genome. Once you sort of get that sort of, you know, threshold of enough kammers, and then you define a hit, then you're sort of left with, you know, assigning taxonomy to it. And so this is just showing sort of the basic idea of a lowest common ancestor approach. And so what it's showing is like different taxa, and this is, could be a phylogenetic tree, but usually it's just a taxonomy tree, right? So we can group things into their, at a genus level would be here, or at the family level would be up here. And now because, you know, you have your short 150 base pair read or whatever length it is, you're going to get multiple hits to different genomes, right? And so depending on how you threshold that, you're going to find that, hey, maybe you have hits to one genome, you have hits to another genome. So what do I call this, right? Do I call this an E. coli? Do I call it Seminole and Terrica? Do I call it, you know, Salmonella or just Eserichiae? So what do I, what do I call it? And then a lowest common ancestor approach is just saying, I will take the lowest point in the taxonomy tree, where I can cover both the things that I've said were significant hits. Okay, so this is a bit of a generalization and there's different flavors of how to do this lowest common ancestor approach. But that's the general approach used by many of these tools. So in this particular case, since we had a hit here and a hit here, we would call it at the family level that we have assigned this particular read down to enterobacteria because that covers both of those. And it says, I don't know which one it is below those. Okay. Yeah, pushing the max. Yep. So with Chime, like say we were using naive base and it's sort of the same fundamentals are what define how far down in taxonomy that you can put a label on something. It's because of the same fundamental problem. It's not strictly a lowest common ancestor approach though. But fundamentally it's the same rationale about why you can't push it any further down. Okay. Okay. And then just to try to explain Bracken, which is really always fun. So after Cracken, the author state you should use Bracken and essentially what it does is it helps refine your taxonomy assignment. So Cracken does this sort of really fast searching. It does this lowest common ancestor approach. But the problem with this lowest common answer approach is you can imagine that if you sequence a lot of different species, right within a group and you try to assign taxonomy, the more you sequence below it, the more difference you don't think is a good a thing and actually sort of restricts how far down you can assign taxonomy to it. Okay. So what Bracken tries to do is it tries to correct for that and push your estimates of what taxonomy exists back down the tree. So you get better sort of more higher resolution taxonomy assigned it. So what's showing here is that, okay, if we had Cracken, which is in blue, right, there's multiple. What is this? Okay, Mycobacterium and it might actually quite quite a few of our reads like higher up in the tree. Those reads to its proper genome. And how does it do that? Well, it uses an amazing approach, but it essentially uses the idea that if you have certain taxa in your sample, you should be able to say, okay, well, I already know that this taxa really is for sure. Okay, we have this species A. We know it's here. And now we can just start adding some of those reads that just got placed higher up because we didn't have a good enough resolution and we'll start to bend them into this genome because we think it actually, they belong there. Okay. So it's an estimate to sort of help refine that so you don't get all of your reads going higher up in the tree. The take home message is that you should run Bracken after Cracken if you're using the tool, and then it helps, you know, improve species abundance in the sample. Okay, so just backing up a bit. We talked about Metaflan. We talked about Cracken. And there's lots of variances on both those tools. So which one is best? This classic question that my laptop tends to ask. And so this is a difficult question to assess. It depends on the database you're using. So often different tools use different sets of genomes, right? Some bigger, some smaller, some curated, some more refined. It depends on how you're testing, right? So people use different models and simulated communities. It depends on the cutoffs for each of the tools. It depends who you ask. So if you go through it, I could, I used to show all these papers where basically every tool shows that there's the best, which is always fun. And then it really depends a bit on this underlying approach, right? So this idea of like Cracken and Metaflan, their target is the same, right? We want no taxonomy composition, but they are quite distinct different views on what you're trying to get, right? One is saying this is the taxonomic composition. The other one is saying, how many reads can I annotate? And so there's a bit of a variance on how we describe a taxonomy composition. And I will go further. But yes, I will go further. That's why not. So what I mean by that just in real terms is that if you think about taxonomy composition, you can think about it in two major ways. And there's a nice paper from Rob Knight's group about this. One is counting cells, and one is sort of counting like DNA and genome, okay? So if you use Cracken, and we're taking an all reads approach, you can imagine a larger genome, right? We're going to sample more of it. So we're actually measuring sort of the amount of DNA in that taxonomic composition, right? So if a bacteria has a genome that's twice as big, it'll show up as, you know, sort of twice as much in our stack bar chart. Whereas Metaflan uses a marker-based approach. And so it doesn't have that bias due to genome size. It's sort of measuring a single marker or multiple markers, but a single copy is within a genome. And so at that point, you're sort of counting cells, okay? So one is really sort of counting numbers of genomes or cells, and the other one's counting sort of the size of the genome. Yeah, yeah, question. Like say between Cracken and Metaflan? Only if you want to go and say it. I would never say you shouldn't. The reality is you'll get pretty different results. It may be good for your own sanity, but I'm not sure if you'll be able to turn it into something cohesive. Now, what I was going to show, I guess, and I try to cut out a lot of this, is that we did do this comparison, Robin did this comparison, which was a huge rabbit hole, and it went just nuts. And you can check out the paper to see, again, we're trying to want to find the best. And of course, it depends is the big answer. But the one thing we really wanted to show, and I think it's sort of new ahead of the paper was that Metaflan. And so Metaflan 4, of course, the paper, we just put it out on Metaflan 3. So Metaflan 4 just came out afterwards. So I guess on top of it is Metaflan 3. But there are some key differences. And so for Metaflan 3, it's fast and low computational requirements. So you could probably run Metaflan actually on your laptop. That's really nice. Very simple bioinformatics setup. You would install it. You could get it working done on a day. It's fairly good for well characterized where we have good reference databases. So really good for human microbiome studies. And because of that, we see pretty good precision. So if it says something is in there, it's probably in there. And there's not too many false positives. Where it suffers is some on recall. So finding all the organisms in the sample, all the taxa, especially for environments that aren't well characterized. So if we're talking about soil, talking about our dolphin guts or something, something where we don't have a lot of great reference genomes, we'll suffer on that recall. And so I don't think I have the plot in here. But essentially, there is one plot that where we show, you take a particular sample, like a soil, something that saw a soil, 60 nests showed around. Oh, I don't even remember the numbers. It was like 100 or so ASVs. Okay. And then crack and spit out by default with this default parameter, like something ridiculous like a thousand different species. And then Metaflan said there was like 10. Right. So like those are like non-compatible numbers. Like this is like a basic thing, like how many things are in my sample? And you're getting, you know, at the time Metaflan showing like 10, if you did six nests sequencing, like maybe a hundred and then crack with default parameters was spitting out like some crazy number. So obviously we say soil with Metaflan three, we were like not satisfied with that. We're like, okay, we know there's more than 10 things in here. So it's really falling over at this point. Maybe it's better with Metaflan four. So that'd be great. And then so we talked about that in the paper. And then with cracking too, we show that it's, you know, probably a bit better for environmental samples, whether not well characterized. And the big thing that Robin really explored in that is this use of a confidence cutoff threshold. And I think it's in the tutorial a bit maybe a description of it. But essentially the default is zero with cracking. And then basically show that that's probably a bad idea. So that leads to a lot of these false positives. And if you increase that threshold some, then you remove a lot of that false positives and a lot of that noise. Now picking an optimal cutoff is not perfect, of course. And so that's, that's up for grabs. And then the other big thing is if you're using something like an all reads approach, say like cracking or send a fuse or something, you know, the bigger you're starting database is really the better. Yeah. So more comprehensive, as big as you can go, really helps things. And as big as you can go, really comes probably down to the time you have to build a database and your computational resources you have. So for cracking, they have different versions, but often what they'll do is they'll load the whole database into RAM, right, memory in your machine. And so for your personal laptop, you know, you might have 16 gigs of 32 gigs. These databases are large, like 800 gigabytes, a terabyte of space. And so that, you know, you're not going to do that on your laptop, you need access to significant computational resources to do that. Okay. Yes, questions here and then back. Yep. Yep. No. So, and actually it's a good point. I don't think I talk about contamination at all, unfortunately. But it would be considered something you would do to post this step. Yeah. And so there is a few different approaches out there. We haven't used a ton of them. So the idea is, hey, okay, you sequence something. You've run some negative controls and you show that maybe there's this, you know, sequencing, reagent contamination or some low level thing that isn't your negative control. How do I remove it from my real samples? Right. And so I've seen everything from the basic approaches. I just identify those taxes and then I just take them out of the sample. You know, pretty crude, but I guess it works unless it's like a major player in your sample. And then there's more elegance tools out there called one, the one that pops in my head is decontam, which is like an R package, which lets you take those negative controls and then do a better approach to try to remove those. I don't have a ton of experience with them personally, but yes, it exists out there. But that would be considered something to do, post this step. Yeah. Sort of in a filtering. Yeah. Question on here. Yeah. No, no, I mean, that's, to be honest, that's good, right? That's what we typically see. So we will often run our negative controls. And we get asked a lot why we don't always run negative controls like at the IMR. And the reality is we, we don't have like any reads in those, in those bins, or they're super small amounts, right? Like talking like 10, 15, 20 reads, maybe 100 reads max across certain things. So again, unless you're really, you know, focusing on the rare microbiome, it, for us, I guess we don't worry about that much, but if we don't see it, that would be a problem. But there could be, mine was going back, right? There could be situations where it's still good, right? So we always do negative controls for extractions just to view like amount of DNA in those negative controls. Sometimes we'll sequence those, but typically we don't see that much. But again, like we're doing a very, you know, low biomass thing, like we're doing say, John's placenta microbiome, right? Like, I mean, that would be a good thing to, to check out. Okay, let's move into function. And as always, I'm slower than I expect, but that's good. Okay, so we're talking all the attacks on me this time. We're going to talk about function and about, you know, what does that mean? And I just want to cover some real basics about what I mean by function, because it means a lot of things. So by function, microbial function, we could be talking about really general categories, right? We could be talking about, you know, does my micro photosynthesize? Does it do nitrogen metabolism? Does it have glycolysis? Or we could be talking about really specific functions, like annotating a single gene as a particular enzyme, right? A particular type of genes. And so there's this hierarchy almost kind of like hierarchy and taxonomy from very precise things, the gene level, annotating it with a particular function. You could even think about domains all the way up to very general pathways. Okay. And what's even more fun than that is that there's lots of different types of functional databases, functional databases. And so I sort of allude to that a little bit here. If we want to take, talk about annotation, we can go through a whole history of these things like cog, which are clusters of orthologous genomes have been genes have been around for a long time, but they haven't been updated forever. There's a system called seed system, which is the front end on that is a tool called MG RAST, which as far as I know, still exists where you can upload your data. PFAM focuses on protein domains. Uniref has really become, I would say, the most comprehensive and often the backbone of a lot of tools out there. And essentially it's just a clustering of genomes of genes at particular identity thresholds. So within Uniref, they have Uniref 100, which would be 100% identical. Uniref 90 saying they're 90% identical. Uniref 50, 50% identical. Okay. For a lot of those, we don't have an annotation to say functionally what they're doing, but it's great as a starting way to have a gene catalog. And the other nice thing, there's lots of mappings from Uniref to other classification schemes. So you'll see mappings from Uniref to EC numbers, which are found in the keg database. The keg database is still pretty popular, even though it's been under a licensing fee for quite a while, but you'll still see people talk about keg orthologs, keg pathways. If you've seen one of those nice, you know, pathway metabolomic, that sort of pathways with things highlighted over it, often that's a keg pathway. And then Metasike and keg are probably the biggest competitors with Metasike, I would say becoming the more popular option now. And so it's primarily thought as an alternative for keg or replacement for keg. And so you'll often see a lot of systems mapping to Metasike at some point. I guess I didn't talk about the EC system or enzyme classification, you'll often see EC numbers. And those are found within, you know, keg or Metasike. EC numbers are these guys, right? So enzyme classification, and then they'll follow it usually by a numbering scheme and then a, you know, a definition for those. There's not enough time to go about through all these different databases, but I'm just giving you the highlight that there's lots of different databases out there. You can often map between them. I didn't talk about gene ontology, you know, the go, which you've probably heard about as well. That's another system out there. And you can sometimes map from particular genes to the gene ontology. For annotation systems, now there's quite a few things out there as well, right? If you have your data and you want to get functions, how, how would you go ahead and annotate? There is some web-based solutions out there. So you can upload them to EBI's, Managenomics server called Magnify. I don't know if NCBI has one anymore. MG RAS was a popular one back in the day. I'm not sure how it works. IMG slash M allows you to upload data and get back profiles. I would say like most online systems, they're not too bad, but they are a bit restriction what you can do with it. And then you're at the mercy sort of of waiting for them to do the annotation for you. But a nice one-stop shop. Megan has been around since the earliest days in microbiome. And I don't know what version they're on now. I don't personally use the tool, but it's another graphical option out there that you may want to explore. I think they're like version four or five. Has anybody used Megan before? Anyway, it exists. It's cool. It's nice. And then there's lots of sort of what I would say local based systems, right? Like things that we're going to run today on a server that you would have complete control out of how to operate those. There's a lot of different systems out there. I mentioned Carnelian. Humon3 would also probably be, I would say, the biggest pipeline out there. So this is also from Curtis Hutton-Howers Group. I'll briefly describe it at a fairly high level. But then there's also, I would say, a lot of people still just sort of homebrewing their own approach for functional annotation because there's so many different databases and there's so many different sort of things you want to do that you'll see people just doing some sort of search against the database that they're most interested in and sort of homebrewing it themselves. And so I call this microbiome helper. We've used humon quite a bit in our lab. And then we've also just put together a pipeline that uses MMCs to do that, that searching and then does some normalization after. Okay. Okay. So challenges in functional annotation are very similar to the ones with taxonomy. There are partial gene fragments. A lot of times we think about function. We're thinking about like a whole protein, right? But we're operating on raw reads. We're not even getting a whole gene, right? We're getting 150 base pairs of a gene, right? That's 50 amino acids. That's hard to call something definitively this function when it's only a fifth or sixth of a whole gene. You get lots of things across different types of organisms. It's still really large databases, both on the database side and your sequence. And then there's different sort of normalizations we have to take into account maybe to help correct what we spit out into our table. And the two big ones I want to just talk about here is gene length and inferring modules from pathways. So gene length is just this idea that hey, if we have a gene that's twice as long, we're going to again by chance sequence it twice as much, right? But usually we want to count those genes as individual chunks. And so often you'll see in many of these pipelines a normalization to account for gene length. So they know that particular gene length and they'll just essentially divide your count by that gene length as a normalization step. So you're not counting really long genes more so than short genes. And then the other big problem is inferring module pathways from different organisms. So this is the idea of, you know, you can have genes annotated and we often want to group those into larger pathways to say, hey, is this particular pathway present in my sample? And it's very difficult then to differentiate whether that pathway is all encoded in a single organism versus, you know, across multiple organisms. And also even for whether that pathway is completely covered or not, or whether only part of the genes in it are covered. So the key steps of functional annotation pipeline is, you know, some sort of searching. So as I mentioned before, BWA, bow tie are very fast, but they're limited to DNA space. And then what we'll see come in here for time of function is we tend to focus a lot more on proteins. And because of that we'll use tools like diamond and MM6 too. So they're actually a bit slower than the mappers, but they're much more sensitive because it's in protein space. So you can find more distant homologs. And also, what does someone mention there? Oh, yeah. So you could do this with BLAST, but obviously BLAST would just be way too slow. And so you'll see approaches like this do it quite quickly. Yes. And then comes down to database depending on whether you're talking about comprehensive or more focused and large databases are good. But depending on, you know, what you're interested in, this is where you don't really need to have, you know, the biggest database ever if you're particularly focused in a certain area. So we'll talk a little bit about, you know, like if you're interested in beer lens factors, right, and you can focus on a beer lens factor sort of database and not focus so much on, you know, say carbohydrate metabolism. Oh, I see I got myself ahead of myself with my slides. So there are these different normalizations. I just talked about normalizing genes by their length. You'll often see this reported sometimes in reads per kilobase per million or just reads per kilobase. This kind of comes from RNA seek land. But again, you'll see it reported as account information like a number. And that's been sort of normalized by gene length. There are other ways that go into determining what to call a function or not. And it's very detailed and I could go into it quite a bit. But, you know, since you're taking into account, you know, how similar your sequences to the database, you can try to normalize based on that. You can try to normalize by the average genome size of a sample. This is kind of an interesting thought experiment about whether if you have a sample here and a sample here, and genomes on average are much larger in this sample than this sample, then you'll actually, you know, always find like more functions in this one. And so there's a tool called microbiome, microbe senses, which tries to correct for that bias. We don't see it applied that much, but it's kind of interesting. And then often, you know, a lot of these tools like human will do some sort of scaling factor just to give you sort of whole numbers or larger numbers instead of very small decimals. But just taking account that the things that you see, the number that you get may not be, you know, actual number of reads mapped to this gene. They're normalized through various processes. Okay, and then pathway inference is probably the other big thing that you would see in something like human. And this is again, where we're trying to reduce essentially spurious pathways. So you can imagine if you have a pathway, my example here is in the second part, if you have, you know, 20 genes in the pathway. And in your actual, you know, sample only two of the genes in the pathway are found. Do you call that pathway like twice? Do you like call it zero times? Like how do you figure out what the count of that pathway is? This isn't simple like taxonomy, right? Like, and so it could be that the pathway is not covered, right? Two out of 20 doesn't seem very sufficient. Maybe you didn't sequence enough and that all those genes are really there. So there's different approaches to try to to essentially gap fill and remove spurious pathways. A lot of them are required on a tool called MinPath, which has been around quite a while. And as far as I know, humon still relies on MinPath to help curate and trim off spurious pathways that shouldn't be counted. Again, take a message here is that pathways are sometimes a bit of an approximation from the basic genes encoded in those pathways. Okay. Moving right along. So this is humon. And so, you know, it's a very popular tool. And so, you know, I just thought I'd present the major steps here and sort of how it does its searching just to give you a flavor for it. So the idea here is that it has sort of two major searches going on. So you have your input here. And what they do initially is that they screen, you know, your reads first using their metaflan tool, say what taxa are present in your sample. And then based on those taxa, they essentially then can take the genomes representing those taxa in their database and screen your reads only against those genomes. Okay. They do that with bowtie, which we know is really fast. And then after that, they're left with all the reads that, you know, didn't map to one of the genomes that they think is in there. And for that, they do the sort of slower translated search with, I believe they use diamond unless they convert to MMC. And then they do that search against all of UNRF. And then they combine the results together. Why would they do this two-stage approach? It's mostly speed. It speeds up their process a lot because to do this translated search on the whole UNRF database is really slow. And so doing this initial search really helps speed things up. The other thing it does is it gives you stratified output. And so we'll talk about this a little bit mostly just because it's dear to my heart. And we have to help some tools around it. And I'm a big sort of opponent of using this information. So if we think about functions, we can, you know, annotate a particular, you know, function. So this is an example coming out of humon 2. So they've identified that this particular UNRF 90 cluster is present in your sample. Okay. They have a gene name for it. IMP dehydrogenase coming from that come from the UNRF database. And then they give you this count information. So this would be our PK say it's 600. So that's in one sample. So that's how you can view that function. And you would have, you know, your list of functions in a table by your samples. And you're all set, right? And that's great. That's cool. And then so people will often do that for functions. They'll do this measure of functions with all their table with the table functions. And then, you know, they'll also do their all their analysis on a table of taxa. But they don't often join the two. They sort of just like do them separate, but we know they're really connected, right? So the nice thing about humon, which I like is that you can get the stratified output. So instead of just getting that total, you know, functional amount, they'll actually give it to you as a breakdown. So you'll get a breakdown of what taxa contributed that function. And then there's always this sort of unclassified amount, which just adds up, you know, we don't we don't know what tax is there. Now, this is a beautiful output from because I think I stole this from Curtis's group. The reality is, is this unclassified proportion is really high for all their output. So like this breakdown would be beautiful. But in reality, when you look at their functions, it's usually like one or two things. And then like, like 90% of that function would be unclassified. So we didn't really like that in my lab. And so that's why we sort of homebrewed our own approach outside of humon to try to improve that stratified output. The other thing you'd get from humon is like pathway information. And so this is just showing a metasite pathway. And then they'll give you sort of two different approaches to this, they'll either give it to you as a cover, it's just saying whether it's covered or not a one or zero, but they'll also give you abundance information. And so this comes back to how they what they call a pathway, how do they calculate this, this number? Okay. But then you can take those pathways and do analyses on those. So coming back to the stratified output, like why does this matter? And I'm nearing the end here. So this is good. Is that, you know, if we just talk about a particular single function, so this is just function X. And we looked at that across, say five samples, we might see very little variation in that function, right? And without taking account, you know, what tax that contribute that function, we're sort of left in the dark about, you know, what's going on in the community. But if we take that stratified output, we can actually then, you know, visualize and count how different tax that contribute to this function. And this is highlighted a bit in this paper virus, but it's also highlighted in the, I think it's the human three paper. But this idea is that, you know, we can get different situations where we have, you know, situations where we have a lot of different tax that contributing fairly evenly to a function, which is okay. We have this situation over here where all of a sudden we see really different tax that contributing this function, which is pretty cool. And if you contrast just B and C here, right, those are telling really two different stories. B is just like, okay, it's conserved across tax of kind of boring. This is suggesting a bit more like, Hey, maybe like this is really important to this community, like we see the same amount of function. And it's being contributed by different things. Maybe it means that this function is actually being selected for by the community, because we always see it. This one's a bit more boring, right? So it's just a single tax that contributing most of that function. And again, we might have a situation down here where we have, you know, a single replacement. So it's hard to tell, you know, in this case, whether it's just a difference in, you know, what species is there versus, you know, some sort of selection pressure. But I've been really going on about this for a while, just that this information can be leveraged quite a bit. So there's different ways to sort of think about that data. And there's not many tools out there that will actually handles the stratified output. You can get these tables, but then like, how do you visualize it? What do you do with that? And so we have been developing this one tool called Jarvis. It's not perfect. It's a little buggy. But it runs in our studio. And the idea, though, is that, you know, it gives you a fairly easy way, if it works perfect, to visualize essentially the connections between your samples sample groups. This is Crohn's disease versus non your taxa, which are, you know, found in those different sample groups. And then the functions, you know, contributed by those different taxa. Great. So the nice thing is, you know, so you might find and your choice of what pathways or functions to visualize comes down to, you know, how you want to filter the data. So maybe you've done some differential abundance testing ahead of time. Maybe you just took the most dominant, whatever it is. But it's nice in that, so you can find a particular pathway, say this 7208 and see that it's contributed by, you know, different taxonomic groups. And then how that relates back to samples. So anyway, I think it's kind of cool, hoping it gets cooler. But yeah, we're going to, it's at the very end of your tutorial. So if you get there, and you get to work like, I don't know, I should give you like a bonus, bonus prize or something. Anyway, I'd hope to see more tools like this in the future. Okay. And then my last, really last two slides are just about specialized gene annotation systems. So we've talked a lot about fairly general functions. But this is where metagenomics like it's blown open, right? You don't have to use human, you know, three as a pipeline, there's so many different options and ways to think about metagenomic data to, to annotate it and to do different things with it, but even just to functionally annotate them. And so there's lots of different systems out there. So I talked about virons factors, there's a virons factor database, right? So you can find, you know, compare the number of virons factors between different samples. AMR is huge, right? And microbial resistance, both in, you know, in, in people, but also in the environment finding these genes. And so you'll see a lot of people using the CAR database. To essentially map, you know, reads to the CAR database and try to come up with, you know, a catalog of different resistance genes in there. And that can be as simple as, you know, you're doing your own little diamond search. Or there's more elegant systems being created to try to use machine learning and things to, to improve upon that, that catalog. If we're interested in carbohydrate metabolizing genes, you'll see often KZ highlighted, it just gives us really nice catalog and classification system for breaking down different types of carbohydrate metabolism. If you're interested in phage and prophage, we really don't do it any service much today in this talk, but obviously there's a lot of annotation systems out there for both, for primarily for sorting, you know, what we think is viral sequences versus not, and also trying to assign taxonomy to them. So I've seen vibrant for your sort of your finer marble, some really cool stuff in that area, you know, trying to link phage or viruses to their host. So if I'm a phage, the bacteria host using like CRISPR spacers, and there's a lot of element in the bifmax world around that. And then there's just other large genome catalogs out there. And you'll see a few of these sort of make a big splash. And sometimes it's, I don't see them use that much in practice, but they are pretty cool. So the one I highlight here is my last slide, this global microbial gene catalog. So they basically built, you know, this huge gene catalog, like they took as much data as they could, they grouped them into different genes. They have, because a lot of this comes from that genomic data, they have the original source of where they found this gene. And so you can use these catalogs to say, Hey, I have this really interesting gene, does it map to it? Where else has it been found around the world? And do analysis sort of from that perspective. Okay, so I have talked for a straight hour and 15 minutes, I do not want to mention that this is just a quick disclaimer, I always like to say this, that we're talking about, you know, possible functions, these are like DNA level, we're not at transcriptomics, John's going to go a lot more detail about this tomorrow. But remember, we're measuring, you know, DNA, right? So we're not talking about transcripts from metabolites. So function is kind of like this idea of potential function. And also that the DNA, depending on your environment, it might not be even from a living cell DNA lasts quite a while. And so just take that into account when you're thinking about describing functions in a community, like, is there any chance that a lot of those cells are dead and some sort of leftover historical thing?