 All right, well, welcome, everyone. So today I'll be giving you an overview of how to analyze short-term and genomic sequencing data in general, and I'm going to focus in on a few specific tools, but I'm also trying to give you, going to attempt to give you the lay of the land a little bit, because I know from experience as soon as you start trying to analyze this kind of data, you start googling, how should I, what tools should I use for working in meta-genomics? These instantly come across like 50 different tools, they all claim to be better than the last. It's very confusing, it's hard to know what to use. So I'm going to try to tell you about the major classes of these tools, the most popular ones out there, and then I'll try to emphasize some of the caveats up. So the main learning objectives to this lecture will be, again, to introduce the pros and cons of using shock-imaginomic data in general, as well as the tools that come familiar with the major approaches to profile the taxa, so taxonomically profile, so who's there in the community, as well as understanding the difference between functional and taxonomic data types. So I'll go over the somewhat subtle difference between these data types, and then I'll focus in a bit more detail in the Humon 2 pipeline, but I'll also focus on a few other examples as well. Okay, so as we learned yesterday, there are two main approaches for profile in the microbiome are 6-Ns RNA gene sequencing, so this is when you have essentially a barcode of who's there in your community, you're able to amplify out these variable regions, which tell you which genre, which species are present, typically don't have more resolution than that to go lower, but with the noising methods you can get theoretically biological sequences, so you can, in some cases, discern strain level, and then using tools like PyCrest, you can get the predicted community functional potential. So basically, okay, so I know which taxa are here, and as you know, when I say taxa, that's just a general term for what microbes are there, so it's just, it could refer to what species or which genre they're, it's just the general term people use, and so based on that information, what are they likely doing, what functions they're probably doing. In contrast, Chocometagenomics is this case on the right here, where essentially, you can imagine we have this one sample of lots of different microbes, we have, you know, see this little differently covered genomes here, essentially these genomes are just all chopped up, and indiscriminately, they're all, all the DNA sequenced, so importantly, the, this approach doesn't know where the DNA came from, so there could be human DNA there, it could be other contaminants, so it's much less specific than six-in-s sequencing, but you get a lot more information about, for instance, what strains are there, because you can actually get strain level genes, and you can identify that more easily, but as we'll talk about, the complexity of analyzes data is, it's essentially a lot harder, as you can imagine, since you're not just looking at one gene, you're looking at essentially all the genes in the community, and so directly you can get what the community's functional potential is, rather than having to infer it. So the major pros of six-in-s sequencing are, number one, that it's well-established, so although, you know, CHIME2 might, some people, you know, might be a little struggle to learn at first, maybe a little learning curve, but once you just get it established, it's actually pretty straightforward, it's well-documented, it's been published many times, at least CHIME1 has, and so there's a lot, there's a wealth of literature out there that can help you work with six-in-s, shock-emitted genomics is a bit more of the wild west right now, it's really improved in the last couple years, but the tools are really just now starting to come out. So the major advantage of six-in-s, again, is that it's a lot cheaper, as you can imagine, since for each sample, you only need about 50,000 reads, and so we'll talk about in a second how you need a lot more for shock-emitted genomics. Again, it only amplifies what you want, so since the six-in-s gene isn't in humans, for instance, you won't get very much human contamination at least in your six-in-s data, and so this is a major problem, if you're working with shock-emitted genomics from biopsy samples, for instance, you can get something like 99.9% in some cases human DNA, and so you really have to be aware of that. So a major con of six-in-s sequencing is that how, what primers you choose can really affect your results and what variable region you look at, so Will was talking about a little bit yesterday, so for instance, you can identify archaea with the standard V3, or V4 or V5 primers, but they don't amplify as well as a lot of bacterial clades that are commensals in the human gut, and so you might be missing a lot if you're interested in archaea, specifically, you might want to choose a different primer set, so there's a few different primers that have been basically tested out for different clades of interest, and then again, you don't usually have enough resolution to identify down at the strain level, since even sometimes, in some cases, species of in a genus even will have identical six-in-s sequences, so you won't be able to see anything lower than that. Again, any clade like the virus, which doesn't contain the six-in-s, you won't have any information on that, and of course, it doesn't provide direct functional profiling again. So Chocomanagenomics doesn't have this primer bias, it allows you to identify all the microbes there, so if you're interested in viruses and your carioles, you'll get that information along with the bacterial data, and it allows direct functional profiling of your data, so again, since you're sequencing everything, if you have a read that maps to a gene, you can at least get some sort of confidence value on how likely it is that a homologue or something similar to that gene is present, so it's a very different approach than six-in-s. So the downside is that it's a lot more expensive, so for one sample, it typically needs something like like 15 million reads rather than 50,000 for one sample of six-in-s, so that would be a good sampling depth, so that obviously it depends what information, like for your own study, whether using six-in-s, just to get the taxa would be more worthwhile, or whether you really want to invest to get the functional profiling data from Chocomanagenomics. So it's something to be considered carefully. So like I mentioned, this host site contamination can be really significant from Chocomanagenomics data. Also, because you're sequencing everything, that means that if there's some abundant microbes, very large genomes, you might be oversampling those, and if there are some rare genomes out there, you might not have a lot of information on those, and so in contrast, if you're just amplifying the six-in-s, it's a lot easier to get an idea of, okay, what are all the variants of the six-in-s out there? Whereas in Chocomanagenomics, depending on the complexity of your community, you might need a lot higher depth, so that's something to keep in mind as well. And lastly, what's often overlooked is that if you're dealing with Chocomanagenomics, typically the computational resource is required or more restrictive, so you typically need more memory, and in some cases, you might be more computing power, otherwise your jobs might take a couple days to run. And then the actual underlying bioinformatic analyses tend to be more sophisticated. So something to keep in mind is basically whenever we're talking about metagenomic or six-in-s amplification, that we're talking about DNA sequencing, and so just because you can identify the attacks as present, that doesn't mean that it's active necessarily, or even that it's alive. So similar to this is that when you were looking at functional profiling using these data sites, it's important to keep in mind that this is functional potential, so it's what could potentially be expressed, and so this is a level away from the transcriptome, which you're going to be learning about tomorrow morning, I think. So people have looked into this a little bit, so generally, on average, there's good concordance between the abundance of the genes identified based on metagenomic approaches and the relative abundance of the mRNA from the same genes based on metatranscriptomics, so you can see here a reported Spearman's row value of 0.76. So on average, it seems to be pretty highly concordant, but there are a lot of major exceptions, so for instance, if you're interested in genes involved in methanogenesis, they tend to be very highly expressed, even if they're only present in a few copies, the same for antibiotic resistance genes, and then contrast, sporalation genes might be widely dispersed, but then in most contexts, they might not need to be expressed, so it's important to just interpret the data correctly, so this is DNA data, just because it's there, it doesn't mean that it's being used. So generally, when you're working on next generation sequencing data, a really important step is preprocessing, so for the 6NS as well, it's important to think about how to filter your reads, but especially for shock and metagenomics, this comes up, and so we'll refer to yesterday how, especially as you go along the read near the end, the quality scores will drop off, and so the main way that quality score is measured is this FRED quality score here, so you'll see these Q values, and so if you looked in the FASQ file yesterday, you would have seen all those weird characters on one of the lines that would just be alphanumeric, so there could be like exclamation points, and a capital Z, and those would have different meanings for what quality each base in the read was, and so when you're looking at the box plots of the quality, that's what was being plotted, and so essentially what it's showing is the P value of that base being an error, so that's how that should be interpreted, and so the higher the value, the less likely it is an error, and so based on this score, people typically will trim back reads from the end, so there's a tool called Trimimatic, which is quite popular, but there's a few others out there as well, so this is an important step when you're analyzing your data, and it can be boring, but it's really important to keep track of your options and think about them carefully, because for reproducibility, if someone was were to process the data differently, they might get different downstream results, so it's really important to think about this, so a similar processing step is to identify contaminant reads, so again, this comes up a lot with human and mouse DNA and medical microbiome studies, so typically what I think will again refer to this, but typically the reads would be mapped to a host reference genome, so we know the human genome, we can take reads from our samples, see if they align anywhere in the genome, and it's important to keep in mind that usually when people are mapping reads to a genome, it's for something like sniff calling, we want to identify mutations in the human populations, for instance, so here the actual, we might want to be more lenient in how we're mapping the reads, so you might want to change the options so that even if a read just maps okay to the genome, you still want to throw it out because you want to be conservative, so it's important to think about that, so you might just not want to use BWA the way you'd always used it used in the past, you might want to use it in a more lenient way. Also wanted to mention that Phi X, this microvirus, which is the first genome sequenced, is a very common sequencing control, and so essentially, especially in aluminum sequencers, this virus will be, the viral genome will be sequenced repetitively, so again and again, and the reads will be mapped back to the genome, and how well the reads map give feedback to the sequencer on the quality of the sequences, and so these reads should be removed usually by the aluminum sequencer before you get your data, but not always, so some of these reads can get through, and so it's just important to keep in mind because actually Phi X is in the human gut, so if you're studying the human gut, especially you have to watch out for it. So I'm just going to talk about taxonomic profiling now, so now that we have our data, we processed it, how do we figure out who's there in the community? So again, talking about this problem here, so we have each black line here represents a read, so it's a string of these nucleotides here, how do we go from this data into something like this? So we know, okay, based on this complex, the different abundance of these reads, these tax on taxa have these relative abundances. So the two major classes for doing this are shock and agenomics data are marker and binning approaches. So I'm going to go into a little more detail on both of these approaches and focus on one marker, one tool from each of these classes. So some of the downsides of both of these approaches are that for binning, essentially, you're taking all the reads, you're trying to put the grouped in together before you classify them. So again, you have to remember that we don't know before we start analyzing data, how many organisms there are, usually, we don't know how they're expected to cluster together. So this is an attempt to basically figure out how the reads group together. So there's two major subclasses of that. But because often it relies on a similarity search, you can be a lot more computationally intensive than the other approaches I'll talk about. And also, there's something appealing about just taking your reads and aligning it to a reference genome. It's important to keep in mind that the reference genomes aren't perfect. There's lateral gene transfer and many, many microbes are missed don't have reference genomes and many strains don't have reference genome. So it's important to keep that in mind. I'll mention that more in a second. marker gene approaches, in contrast, are essentially approaches that try to use either a single well conserved gene like the 16S or clade specific marker genes that can actually allow you to identify essentially the same ideas 16S where you have a bar code for a particular lineage, which you can identify using your shock imaginomics data. So again, since you're only looking at individual marker genes, this doesn't tell you about the other genes in those genomes. And if you're just using this approach, you wouldn't be able to do assembly as well. And it's highly dependent on your choice of markers, since different lineages can have different biases. And depending on the environment you're looking at, the marker genes, clade specific marker genes might not be very well documented. So if we're looking at bidding approaches specifically, the two major sub classes are composition based and sequence based. So composition based are tend to be a lot faster. So essentially, they look at the reads, and they try to calculate some usually pretty basic metric, and then try to disagree them together based on that metric. And so for instance, they might look at tetranucleotide composition. So something or GC content. So something about how common each nucleotide is in the reads, which should correspond to hopefully a similar clade, sometimes a similar speed that's the same species, hopefully. So that's the idea of that. So again, using these these metrics calculated from the reads themselves. And in contrast to sequence basis, or more of what I was referring to before. So taking your reads and blasting them or somehow doing a similarity search against a large database of genomes, and then saying, okay, so it mapped to a few of these genomes here, I know what species those correspond to. So I'm going to take a best guess or a conservative guess of what species those reads correspond to. So two major approaches are to say the best hit. So if your read maps to three different genomes, but it mapped really well to one, you might say, okay, I'm just going to assume that that's the genome. Or you could say, okay, it mapped to three species in this genus. So I'm going to say that it's it corresponds to this genus. So that would be the lowest common ancestor approach. So again, that's basic example that is shown here. So you can imagine if the check marks correspond to a read aligning, the Xs correspond to it not mapping. We'd have a case here where read mapped to E. coli and Salmonella enterica, but not to the other species, each of these genera, and not to the out group species, Pseudomonas. Then we might say, okay, so it's both of these genera. So I'll say that it's in the, I'll go to the lowest common ancestor and tear it back to ECA family of both of these gen, these genera. And so that would be how that approach works. So there's a few different lowest common ancestor tools out there that will take in your shock and man genomics data and use this approach by line to databases to output the taxonomic profiles. So Megan is one of the first man genomic tools I was published. Actually, it's still going strong. It's being really, it's very well maintained updates are coming out all the time. So it actually has a great GUI interface. And the great thing about Megan is that not only does it do taxonomic profiling, it also does functional profiling as well. And it really has some nice visual visualizations. So I don't think we'll be talking about Megan and any of the labs. But if you're working this type of data, I encourage you to download it and just at least try out some of the visualizations because it could be useful. MG RAST is definitely another one to be aware of. It's again an older approach and it's web based. So running it can take a while. So it's appealing because it should be easier because it's on the web. However, it's known for having much lower accuracy than the other approaches. So I wouldn't encourage you to use MG RAST. And lastly, Kraken is more of a newer approach that is kind of described usually as the fastest approach to date, which I believe is still correct. And so I think originally I wrote highly accurate. So it's highly precise for sure. So by that I mean of the things that cause the vast majority of them, they tend to be mainly correct. But compared to the other approaches, the sensitivity can be a little lower. And so actually Megan tends to have very high precision. But Kraken and Megan, they both tend to be quite similar in their accuracy overall. And as I'll talk about in a second, all these approaches tend to be much more sensitive than the marker gene approaches, which I think will become clear, I describe it more and also in the lab as well. So I'm going to try to contrast the two major approaches. And then Laura is going to go into a lot more detail in assembly and bending approaches. So I'll try not to duplicate efforts here. So I think Laura is going to talk about centrifuge a bit. So it's an approach written by the same authors as Kraken, or at least some of the same authors. So it's a similar algorithm. It's still fast, but it's about twice as, only half as fast now compared to Kraken. But the real benefit of it is that it uses about four gigabytes of RAM. So Kraken, I didn't actually point this out, but the real downside is that it requires 124 gigabytes of RAM or something in that range. So for most people, they just can't, they can't work that. So that's one reason why it really hasn't been accepted by the field. But centrifuge doesn't have this problem. And as you'll see today, it still runs in a very reasonable amount of time, at least on the test data set. And it's important to keep in mind, although I won't be emphasizing this tool too much, that does have a much higher sensitivity than the marker based approaches. So again, I just want to emphasize for your own data, you have to think about for your questions, which approach, what caveats you'd like to accept. Because at some point, you just have to basically accept the problems, be to the tools and just go with what's best for your question. Because none of them are perfect. So again, I think Laura's going to go over this. Oh, no. Okay, sorry. I think I saw it in someone's slides. So essentially, one reason why centrifuge and Kraken are faster are because they're able to compress the genomes together. So by that, all I mean is that, so if you have three different reference genomes here, you don't have to basically duplicate effort when you're mapping the reads. So the program in advance will say, okay, these sequences in these genomes are identical. So why would I bother mapping the same, the read to this identical sequence in two different genomes? It just keeps track of, okay, this stretch of base pairs the same of both these genomes. I'll just remember that. And then it just doesn't, it doesn't bother duplicating those sequences. It just compresses them down and then only does the mapping once. And so that's the idea here. So you'd end up with sort of much compressed genome you'd be mapping to and then you just keep track of what parts of that compressed genome correspond to which genomes and which are unique to individual genomes. So that's sort of the major reason why this is a faster approach. So the very basic overview of this classification approach is that it really does do a similarity search against this compressed genome database. It matches input reads 16 base pairs at a time. And then it tries to extend it as much as possible. And then there's some, the thing about this tool is that there's lots of options you could tweak. Although it's it's difficult to know without doing a lot of testing how to affect your results. But essentially they by default will try to extend it to 22 base pairs. And if they get that far, then they'll say, okay, I'm going to say that that's a match and then it'll give you a score based on how well it matched. And then often what it does is a same read will map to lots of different genomes in this database. And then it'll say, okay, I'll I'll I'll put the five most likely taxes that read corresponds to. So it doesn't actually use as low as common ancestor to prune down to those five taxa, but not it doesn't just prune down to one. So keep that in mind. So in the lab, this tool briefly you'll briefly be writing this tool essentially. And so the key thing will just be to think about how different this output would be if you use metaflan 2, which I'll talk about now. So so that those were these binning approaches for doing the taxononic profiling. So in contrast, there's two major subclasses of marker gene approaches. So again, these are approaches, which in the simplest case, just produce a single gene like the 60 nest. So you you can just only you can map your reads to a database of six and s sequences and just retain those reads and you could essentially use time to you can do that. You could do that for another universal single copy gene as well, like a universal gene such as chaperone and 60 or a host of others that might be more clay specific. So again, technically, you could do that. It's not people don't typically do that. But that that's sort of the origin of these marker gene approaches. And then in recent years, there's been development of tools that either use several universal genes at once that would be file sift. So essentially, that has a database of 37 universal single copy genes, and then the reads are mapped to all those genes. And then depending on basically a bar code for each of those universals, they should be in the majority of bacteria, they might be missing in five percent or something. So based on how well the reads map to each of those, basically have 37 different bar codes and you have a pretty good idea of which clay that corresponds to or at least the relative abundance of those clades. And so a different approach is actually to not just use these universal single copy genes, but to actually figure out for individual lineages. So at the genus level at the family level, what genes are sort of only found in that lineage and are found in everything within that lineage. So I'll explain that again in a second. So again, metaflan two use this approach. So these are called clade specific marker gene markers. So this data metaflan two, at least last time I looked at how about a million markers from 17,000 genomes. So mainly bacterial you can see here. And this can technically allow you to identify down to the strain level. Although usually it's something like species or genus because there's it's difficult to say with confidence that there's a specific gene that's only found in a strain. So there's not the number of marker genes for strains is much smaller than for species, for instance. And so the real advantage of this is that because you're just mapping to this relatively small database of clade specific marker genes, it's pretty fast to do that similarity search. And, you know, one strength is that most you know that most of the reads won't align. So it can be done very quickly. So that's also the downside because you might only be using something like 2% of your reads. And so you're if you're just using tax, if you're just getting tax not a profiling from your managing almost data, you'd be throwing out a large portion of your data. So you might be getting really probably very good estimates of what tax are there based on what's in the database. But again, you could be missing a lot using this approach. So just to highlight again when what clade specific marker genes are. So a core gene is defined as something that's found in everything, all the genomes of in a lineage. So it's clade why here you can see that all the tips of this tree which correspond to different genomes, they all have this core gene in red. But just the definition here is that since the core gene, it doesn't mean that it can't be found elsewhere in the tree. So this tool specifically looks for core genes that are unique to a clade. And so it's this kind of gene that's a clade specific marker gene, which is used to identify the relative abundance of individual lineages. So the general big picture pipeline for metaflan two is that in advance, this section, the choco plan section is basically done offline by the authors of metaflan two. So essentially they have they take these fastest these of the just the raw sequence of each genome, they plug it through their pipeline. So they have annotations of all the genes, they want to figure out, okay, how well which which of these genes are found in all of the all the genomes of in a certain lineage. And then which of these core genes are actually unique marker genes. And then so the problem there is then to figure out what sequence is the most representative because you could have something like 100 different versions of the same homologous gene. And so they have to choose something like the centroid or some other approach to figure out what's the most representative. So they describe that more in their paper if you're interested. But essentially what they get in the output is one representative sequence for each of these clade specific marker genes, which then you can map your reads to using their metaflan pipeline. So again, so they use a blast approach against this database of marker genes. And then so it's important to keep in mind that a single lineage might have a couple different or sometimes quite a few different marker genes for a given clade. And so it's not trivial to figure out based on how many reads match to all these different marker genes, what the abundance of that clade should be. So they internally they figure out that normalization. And so they'll output a relative abundance for each of these each of the taxa that had played specific marker genes that were positively hit. So again, so it's important to keep in mind that metaflan will only output the relative abundance data. And so I think Rob will be talking more about compositionality of microbiome data and how you can actually use the counts of reads directly rather than converting just two relative abundances. And so you lose that information of metaflan to you. So that's yep. It's actually as many as they can do. So sometimes some I'm not sure the distribution, but sometimes it's one sometimes there are quite a few. So as long as they're confident, it's a core unique gene, they'll use it. So again, why would you use this approach given the caveats I mentioned? Well, a lot of people use it because it's really fast. And honestly, it's very easy. And also it provides basic way to get information about all the different kingdoms very, very straight in a very straightforward way. And in general, especially for human microbiome data, it's quite well accepted by the field. As sort of like if you're just getting taxonomy, you can get a quite robust estimate. So again, the major thing you have to think about is you're going to be throwing out a large proportion of data views as approach. So yeah, you'll see a pretty stark difference when we look at the centrifuge output in the lab. So the actual like I mentioned before, the Metaplan 2 pipeline will do a similarity search that is actually down to bowtie 2. So it's a nucleotide the nucleotide similarity search against the micro gene database. So Metaplan 2 and also humon 2, which I'll be talking about in a second, both of these approaches can use paired in data. So forward and reverse reads, but it's important to keep in mind that if you do input, you might seem like you can input your forward and reverse reads independently. And so, you know, that seems really good because obviously if a forward read maps here, you want to be able to use that information to inform to help their reverse read maps. You should be able to get higher quality mappings, but it's important to know it doesn't actually do that. So it actually concatenates them together. So the forward and reverse, so literally concatenates the two files together essentially. So the forward and reverse reads will be mapped independently. And so this overall, this doesn't really affect the result. Given that if you're dealing with millions of reads, but it's just just something to be aware of that it's that's how it's treating that paired in data. So also keep in mind is that if using this pipeline, each sample is processed individually, and then you have to combine them the last step. So in the lab, I introduced this tool called new parallel, which essentially allows you to run the same command and a bunch of different files, hopefully a straightforward way. So again, using approaches like that or some other way to loop over all your files is required if you want to use metaflan2. So we'll go over that in a lab. So another actually, maybe overlooked, plus the metaflan2 is that it's produced by the Hutton-Hauer Lab, which is this, they develop a lot of different tools. And so it's actually pretty easy to plug and play with the metaflan2 output with a lot of their other tools as well. So I'll be highlighting strain plan, which is a great tool if you're interested in profiling strain variation. So a lot of people are starting to turn towards taking shock imaginomics data to actually track strains either over time or between different individuals. And so why would we care about this? Well, it's been well established that, for instance, this somewhat classic paper, I guess, only 2008, but classic to me that based on 17 E. coli genomes, so they had a core genome of about 2000 genes. And then just even just based on 17 E. coli genomes, I know that I think there's certainly over 100 genome sequence now from E. coli. So even just based on those 17, there are about 13,000 accessory genes. So just by knowing that E. coli is present in your sample, you might be missing a lot of the really interesting functions that are actually biologically important in your data. And just knowing the species and knowing the core genes isn't telling you much. And so actually knowing the specific strain and maybe tracking that between individuals can definitely allow you to tell how the communities are changing over time. So, for instance, is it something about E. coli species in general that allows them to be established in an environment or is it a strain specific adaptation, perhaps? And similar to that is the idea that if you can actually show that it's a particular strain of an organism, associate with a type of interest, it might allow you to have more reason to believe there's a causal role or at least some sort of particular gene you could actually narrow in and further for that particular strain. So there's a lot of motivation for getting down at the strain level. And so essentially, although I mentioned that MetaFlan2 can output particular strain of bonuses, usually that's hard because there's not too many strain specific marker genes. So instead, what it does is for your clative interest, usually species, it'll take all the marker genes and essentially it'll take the map to reach those marker genes and it'll make consensus sequences for them. So it essentially figure out, okay, where are the mutations in these marker genes? And then so essentially trying to distinguish the different strain variation on the mutations in the reads that were mapped. And then essentially based on this consensus sequence, different strains are identified in your samples. And so that can allow you, as we'll see, to just pinpoint the phylogenetic relatedness of the different strains. So that can be useful if you're trying to see if at least a different clustering of a particular species are associated with one responsive interest. So we'll be looking at trying to ask whether non IBD and Crohn's these patients have particular different strains of a vectoroid species. And so that that would be one one application of this approach that we'll explore. So like I hinted at before, this is basically a general slide about microbiome data. But it's especially true for metaflan 2 output. But essentially, we're talking about sequencing data. It's wrong to think about it in terms of absolute abundance. So, you know, if we're talking about actual cell counts, biomass, that would be the absolute abundance. We're talking about reads. This is compositional data. And so if you literally did more sequencing, you'd get more reads. That doesn't mean that there's more microbes in the data. And so this might seem kind of like a trivial difference, but it does make a difference when you try to explain this in scientific papers. You'd be very misleading if you're talking about there's more of this microbe than that one between samples. So it's something to keep in mind. So really, you have to keep in mind that you're dealing with relative abundance data. So you can imagine if there's a case where one sample had a trillion cells, one sample had a million cells. And then you found that, you know, in terms of relative abundance, there was five times more E. coli in this in the sample of only a million cells. But it would be wrong to say that there was more E. coli in that community because little did you know that it had so many fewer cells. So again, it's just important to keep that in mind, especially if you're wording how you heard describing this data in scientific reports. So like I said, this is especially true from F line two, since it will only output the relative abundance so essentially the percent of each taxonomic abundance per sample. Okay, so I'm just going to move on now to functional composition. So this is basically just going to introduce what this means. So it's related to the piecrest output we were looking at yesterday. And just I'll be comparing this with the taxonomic profiles that I was just describing. So essentially function is a very broad term in the microbiome field. So when we talk about functions, we might be in very general categories like higher pathways or even something like photosynthesis or even something down lower at specific gene families, which tend to be shared across multiple organisms. So yesterday, we were looking at enzyme classification numbers. So specific reactions in pathways. But there's different ways of making gene families as well. So there's k-gorthologs. And I'll describe a couple others in a second. So here are some really well known functional databases. So cog is an older one, but was widely used, especially a few years ago. So it stands for clusters of orthologous groups. Pfam is a database based mainly on protein domains, based on hidden Markov models of amino acid codes have been different domains of proteins. And keg stands for chiodo encyclopedia of genes and genomes. So this was probably the most popular functional database that might still be. So each event at the start, it was based on enzyme classification numbers. But now it's branched out to be just based on orthologs in general. So it's very well annotated, and it provides pretty easy ways to get modules and pathways, so higher level functions. But the real catch is that now it requires a license fee. So a lot of researchers are using really old keg database from something like 2013 or something. So instead of using keg, because a lot of academics don't want to pay the expensive license fee, we're starting to move to MetaSyke. So it's a database that's similar to keg, but more micro-focused overall. And so that was the pathways that were produced by PyKris2 yesterday in the tutorial. And also Uniref is sort of this, it's been around for a while, but it's a sort of a different approach than the others have described. So there's different, you can just take totally different perspectives on how to define gene families. You could either say, okay, the genes, the similarities genes might, based on the sequence might not be great, but they have the same role in this pathway. They do the same reaction. So I'm going to say that they're the same gene family. For Uniref is literally just taking all the known proteins and just clustering them all together. So either at 190 or 50% and then saying, okay, the sequences that cluster together, that's the gene family. So that's this approach. As you'll see, there's a pretty stark difference between how, if you define function based on this database, you get a very different conclusion. Then if you use something like keg, so it's definitely the most comprehensive. So you definitely get the most information. But the downside is that the vast majority of these gene families are they're really protein families because they're based on protein sequences. They're not, they're not in it. They don't, we don't know what they do. They're uncharacterized. So essentially you'd have millions of proteins. You don't know what they do. And so it could, it could still be informative that there's highly associated if your phenotype or trade of interest, but responsive interest. But just based on bioinformatic analyses, you won't know what it's doing. So this is definitely a sort of a major observation that's been made that you should be aware of. So essentially when people have been looking at tax taxonomic and functional profiles from the sort of data, this is basically a normal result. So this is from the Human Microbiome Project from Tom Doris from Stool Samples is a stacked bar chart. So essentially all of the different colors here referred on the left hand side referred to different phyla and the X axis correspond to different individual samples. And so essentially the big yellow bar here means that this phyla, this phyla is at high abundance within this last sample here. And so you can see that it tends to be highly variable crossbow body sites. And then these are the same samples across both these bias sites, but based on I believe it's keg pathways in this example. And so you can see that it looks like it's extremely highly conserved. So this is sort of prompted people to say, well, maybe we shouldn't care about tax. I mean, like, why would we anyway, we why would it matter what we what label we're saying with the relative boundaries labels doesn't just matter what's actually being done in the community. So at some point that that has to be true. I mean, these are we're just talking about taxonomy. We are just talking about names at some point. And so really we're talking about functions that are supposed to be linked to those names at some level. And so that's if we're defining function correctly, this really should be what we're more interested in. The problem is that even when you're annotating functions might be leaving 50 to 75 percent of your reads on assigned. So a lot of these functions are probably very highly conserved. And so may not be that surprising if they're highly stable. You might be missing a lot of the functions that are more individual specific. And also some have said that it's actually just incorrect to talk about variants across these two data types just because they are so different. So the amount of meaningful variation here might be very different than for taxonomy. That's that's more controversial. And again, as I'll point out right now, this defend how stable this is depends a lot about how you're defining what a function is. So just to hammer this home. If we look at functional variability, variability measured by break Curtis dissimilarity between pairwise samples. So you can basically think of these box plots saying how similar are are in this case, this is human stool samples, but how similar are samples based in each of these different data types, whether we're talking about different levels of taxonomy, or if we're talking about functional levels here. So this is essentially the same thing I was just showing you before. So if you're looking at strain level or even the final level, it tends to be more variable across samples. Then if you're looking at Oh, I gave it away. If you're looking at keg orthologs or even keg pathways, or sorry, keg pathways, especially because they're high level and even keg orthologs, there's some it could be thousands of these keg orthologs. But overall, they tend to be more similar across samples than taxonomy. But if we defined our function based on the unit ref database, we would see much more dissimilarity across the samples. And again, these is based on these these protein clusters. So it's important to keep in mind, just just be aware of what database you're using essentially. So depending on this choice, you will get a very different output. And just again, just to highlight the difference between these, these are pretty stark examples. So there's there there are lots of examples are between these databases, but these are sort of the extremes. So on the left here is the mean number of species that possess either one of these gene or protein families or higher functions. So so for each of these functions in these in these different databases here, how many species have them on average? So essentially, most unit ref for 100 protein sequences are only found in one species might not be surprising. But even if you go down to unit ref 50, so this is clustering protein sequences of 50 percent identity, you really only have about three species on average that possess that that protein. So that might be I was surprised when I first came across that. So I think it's definitely something to keep in mind. I would have I would have initially thought that this would be more widely dispersed and even Keg orthologs. But this is not true at all. So the Keg orthologs in contrast tend to be widely dispersed functions that are widely shared, which in hindsight makes sense because that's how they're defined. It's hard to for the most part, these are functions that have been seen again and again. And that's why because researchers have been interested in them in a few different model organisms, for instance. So that and that's largely the framework that these have been defined in. And so this is also shown not only in the mean number of species that have these functions, but also how many entries there are in each of these databases. So you can see that there's almost 70 million, at least at the time of publication here in Uniref 100, whereas we're talking about 400 pathways in the Keg database. So again, it's these are very, these are both functions, but they're very, very different interpretations of the output. So there's a few different approaches to annotate, annotate the metagenomic data to get functional profiles. So there's some web based approaches here. So like I mentioned, MG Raft is older approach that's online. And so these and EBI metagenomics, the integrated microbial genomes database. So Megan, again, has some great visualizations that has a graphical user interface. So that could be useful if you're trying to process this data. And there's a few local based approaches as well. So these I'm going to be focusing on human too. So you may have heard about human in the past. So in the last couple of years, people haven't moving to the second version of this tool. And the real plus of this approach is that it provides links between function and taxa. So you know, we're talking about which, which are these fancy tools to use. And some someone might be asking why do I have to bother why can't I just take all my reads and just do a similarity search against all the known proteins or all the known DNA sequences right there and just figure it out myself like can't be that hard. Well, the first problem is that if you're just doing blast, you'd be really slow. And so you'll come across this tool called diamonds, which was published a few years ago. It's a couple of organisms orders of magnitude faster than blast for a very slight cost of accuracy. So that would be your first fault. Then also, if you're just doing blasts, a lots of different all the known proteins out there, you know, it's not trivial to say, okay, these very similar hits for the same sequence against this database. So what gene would I call what would I say that this read maps to? So it's not trivial to actually figure that out. You also have to account for these genes of different lengths. So longer reads are more likely to be hit by reads. Sorry, longer genes are more likely to be hit by the reads when you're mapping. So you'd have to account for that as well. Also, if you're talking about higher level pathways or modules, again, the sequencing death might be missing a lot of the important genes, which would be important reactions in that pathway. And so you have to think about how to fill in the gaps to get them basically the best guess of what pathways are present in your community. And so this is related to this last approach here, which, you know, it's, you have to use some existing tools to actually infer the pathway or module bonuses based on the gene families you inferred. So, yeah, these essentially these are some of the major reasons why you'd want to use humon 2. So humon 2 is been described as a tiered read mapping approach. So what I mean by that is essentially start out with your input sequences here. So again, these are the input reads. And here it's been kind of spoilers been given away. We know that we wouldn't normally know this, but in this case, we know that light blue corresponds to one species. This corresponds to another species. Some of the reads might ambiguously map some multiple species. And then some of them might be totally novel. We have those species never been seen before. It's not in the database. So there's given the existing methods, we're not going to know, at least based on read mapping, what what species it corresponds to. So the first step of humon 2 is to do a taxonomic pre-screen of Metaflan 2. So it'll use these clades specific marker genes and map the reads against them to basically figure out, OK, what what genomes can I expect to be in my sample? So if basically if a clades specific marker gene is positive, it'll say, OK, I have some genomes from that clades. So actually I'll just map all the reads to the genomes in that clade because I have a pretty good idea of what genes are should be in those genomes already. So why would I start from scratch? So that's and so that's the second tier here. So the pan genome. So again, after figuring out what clades are there, it'll take the genomes those clades and do a similarity search against just those genomes. So typically this is a smaller subset of your data. And then depending on how well characterized your environment is, you might not have too many reads mapping here. So if you're working human data, typically you see something like 50 percent of your reads mapping at this step. And then so the final step is sort of a more brute force approach. Just OK, I'm just going to take all of these known enough protein sequences and I'm just going to translate a search of all my reads against all of these this whole unit ref database. And so you're doing a comparison against millions of protein families. And so because you're screening out a lot of the reads at this step, there's fewer reads that have to be input here. So this can run a lot faster. Anyway, so those are the three major steps. So I don't have any questions about that. So one of the major outputs of human too is because you have done this taxonomic prescreen, you've done the mapping to the pan genomes for at least a subset of your data, you can get these taxonomic contributions. So the method is that Laurel to be talking about, well I should be talking about genome assembly, you get much more direct ideas about what genes are worth in each genomes. But this, the idea here is that OK, I mapped the reads to these pan genomes, I know what genes are worth in each of these genomes. So if there if a read mapped here that you can kind of interpret as OK, the species is contributing that gene to the community. And so the idea would be rather than just looking at just like the relative abundance of individual genes, you'd be looking at, OK, you could say, OK, this genus is contributing about 20 copies of this gene, different genus is contributing another 20 copies. So you get a very different story about how the functions are being contributed across your samples. So for instance, if you're comparing the last three of the first three here. So typically if you're using humon 2, the vast majority of the of the gene families that you identify are unclassified. So they'd only be identified this last step here. So they wouldn't you wouldn't have a taxonomic link. But by using this contribution, you can actually get some novel insights. So the overall pipeline for humon 2 is shown here. So sort of just highlights some of the things I was just talking about. So again, as the quality controlled reads, Erin put to metaflan 2 for these clade specific marker genes. So you get your list of what organisms are there. And then so once you figure out what organisms are there, you just map the reads only to the genomes for those organisms. And so again, that's for this the chocoplane database has this database of genomes. So then you get so the gene hits for specific organisms. And then so you put that aside, you'd say that for later, and then you proceed of all the reads that didn't map. And if all the reads that didn't map, you do this translated search. So you'd be going against a protein database now. So that's a diamond. And they say Uniref 50 here. So that'd be clustering the protein sequences at 50 percent. A lot of people use 90 percent, which is what I tend to use a few months to. But it's yeah, I think it's a somewhat arbitrary choice. They were originally suggesting you use Uniref 50. They've recently switched to Uniref 90. And I think the jury is a little out on what database would be better to use. Either way, like I mentioned, like I showed before, the vast majority of these proteins are going to be species specific or at least clay like genus specific. And so it's important to keep that in mind. We're doing using this approach. So then you get your hits to the protein families. Then you can you can get a use metasite to get pathway bonuses. And as we'll talk when the lab can get pathway coverages as well. So how well covered the individual pathways are. And so these would be the three main outputs here. And so they're stratified by the contributing organ organism where where it's possible. So the output files will look like this. So we can talk about these in the lab. So you'll be making these output files in just a few minutes. So yeah. So again, you'll have the abundance of individual gene families. So like UNRF 90 gene families, which has this ID, there'll be the community wide abundance of this gene families that's shown here. And then if they show the little pipe, that means that they're showing it broken down by what species is contributing the gene abundance. And then there's a similar output for pathway abundance, too. So this is metasite pathways. And then so you get the abundance and also the coverage of how well those pathways are covered. And yeah. So just this originally made as a motivating example for why you'd want to use Contributional data. And so these are sort of cherry picked examples of cool cases of where if you had didn't have Contributional data of which which species were contributing the functions, you might be missing, you might be totally misinterpreting the data. And so by having this information about which species are contributing the the genes or the pathways in this case, you can you can distinguish this case where the pathway tends to be a very similar relative abundance across all the samples, which are on the x-axis here, the 76 human microbiome stool samples. So it tends to be very highly conserved and it tends to be mainly contributed by one organism. In contrast, if you didn't have this data, you might not be able to distinguish that from a case where it's still very highly conserved across the samples. But it tends to be contributed by very different organisms. So there'd be a different interpretation about how that pathway is being contributed in that in these different environments. And then so there are other examples, too, where it might be this extremely variable across samples and also not consistent, not similar relative abundance and so forth. So by using this Contributional data, you can sort of distinguish these scenarios and get more insight to what's contributing a function in your community. So that's the real benefit of this approach. But again, it's my emphasis. This isn't the only way to analyze this kind of data. And so the next module will be talking about different approaches as well. So yeah, that's one of the show. And if you guys have any questions, be happy to help you out. I guess we have half an hour before the fire drill. So we're going to get started in the lab. Yeah. Talking about centrifuge versus Kraken, are there benefits to one or the other beyond the computing requirements? So if you have the ability to run it. That's a good question. So actually in the sense of each paper, they do compare precision and specificity of Kraken. And so so it's very similar accuracy compared to Kraken. So it does have higher sensitivity and slightly lower precision, but they're very similar. And so I think probably if you're able to run Kraken, it still might be better to run centrifuge just because if you're OK doing the post processing of centrifuges output, because again, centrifuges approach is kind of just like give you all the information, then you have to figure out what you want to make of it. So again, like for one read, it'll say it matches five different genomes. And so you have to really post process that. Whereas Kraken will do one output. So yeah, how you dealt with that output data could really change your results. So I would use centrifuge, but you'd have to do some testing to figure out how to post process it. And can these approaches be used with that? I haven't. I'm not I'm not too familiar with that kind of data, so I'm not I'm not really sure, to be honest. I think they should be able to, but typically they're used with quite short reads. So yeah, I don't think probably that's more applicable for the next module.