 going to be discussing metagenomics. And the one thing I want to mention before I get started is just that metagenomics is sort of the Wild West compared to 16S analysis of it. So there's no great, these are the things you should use, like chime and mother are pretty much accepted. On the metagenomics front, everyone's still doing quite different things. But what I've tried to do is pull out the main sort of bioinformatic tools that I use in my lab that are well received, I think, in the community so that you don't get picked apart by reviewers, and at least gives you a pretty good foundation that then you could do more further bioinformatics on your microbiome, metagenomic data sets. So I think that's sort of the level I'm going at. But I'm also approaching it from a sort of a, let's just get the basics down. So definitely feel free to ask questions that are more in depth at any point, because that's great. I mean, that's why you're here to sort of guide the discussion, and I can try to answer them as best as possible. Okay, so for the first lecture this morning, I'm gonna just contrast 16S versus metagenomic sequencing, sort of the pros and cons of both of those things. Be able to describe, you should be able to describe different taxonomic approaches for determining the, for metagenomic data, determine the taxonomic composition. So different general approaches. And you should understand how Metaflan-2 works, and then how to run it within the lab. And then I'm also gonna talk about stamp and visualization statistics techniques at a very sort of high level. But you'll be able to use that as well on the dataset. Okay, so first of all, you heard all notes 16S yesterday. It was covered really well by Will. So you know that 16S is a targeted sequencing of a single gene or piece of a gene. And that acts as a marker for the identification of different organisms, right? So 16S, you already heard yesterday, it was ribosomal RNA gene. It's conserved. It's universal across all fit all bacteria and archaea. And so it's often used. So the pros of that is it's really well established. The sequencing costs are relatively cheap. So if you're doing about 50,000 sequences per sample or even double that, you're in sort of the 20 to $50 range per sample. And another pro is that it'll actually only hosts, amplifies what you are targeting. So if you're doing something with a lot of other DNA, say like host human or some other hosts that you don't wanna amplify, then there's no host contamination because you're just targeting the bacteria component or archaea component. The cons of 16S is that there is primer choice. You have to choose a primer choice and then each one of those is relatively biased. It's not completely biased free. And so you can get amplification towards certain organisms more than others. Another con is that you can usually not get down to the strain level or even sometimes the species level is 16S. So 16S is well conserved. And so you might hit the mark where things are 100% identical, but there's still different strains within that space. And so identification sort of below genus is a little sketchy and definitely not at the strain level from just an amplicon. Another con is that you need different primers usually for archaea and eukaryotes. So because eukaryotes you'd be targeting the 18S gene, which is the same gene obviously, but more distantly related. And then you don't pick up other microbials like viruses. So with man genomics basically we're sequencing not, there's a single gene, we're sequencing all the DNA in a sample. And by all DNA, it really means all the DNA. So you might be only interested in part of what's going on in there, but the idea is extracting raw DNA and then you're going directly into sequencing. There's the pros is that there's no primer bias. There's no choice in primer. So you get a better representation maybe without any sort of bias towards certain organisms. You can identify all my microbes. So you get viruses, microbial eukaryotes, as well as bacteria and archaea. And the other big thing is that it provides functional information. So not only who is there, but also sort of what they're doing because you're getting samples of the genes within those genomes. The cons is that's more expensive. So typically you're gonna need millions of sequences per sample needed. And so that drives the cost up to 200s, sort of in the $100 to $300 range per sample. Host site contamination can be significant. So if you're doing say a stool sample, then most of that DNA is gonna be bacteria and archaea, but you won't have too much human host. But say if you did a biopsy sample or even a skin sample, you're gonna get a significant number of human cells and then that's gonna amplify in sequence and then you end up filtering that out. So we had this project where we were looking at biopsy samples from Crohn's patients. We were getting about 97% of the reads were mapping to the human host. So we sort of was like, oh, that's fun. Just do all the sequencing and keep 3% of what we're really looking at. But that goes the same for sort of any other thing where you have host contaminations, not host, but say plant microbe associations. Even sometimes with soil samples, you can get application of things that you might not think that you're originally targeting. Sorry, down first, would you mind? Yeah, so you're just talking about the fact that we don't know what a lot of human genes do and we definitely... Do we know what micro genes do? Yeah, we know what some of them are doing, but definitely there's a bigger unknown category. So there's... Actually, I can't put a percentage on it, but I would almost guess that half the genes... No, John, do you have any suggestions on that? About half the genes, we probably can't annotate with anything useful functionally. But that's a really loosey and wavy thing. Yeah, absolutely, right? So even if you don't know the function per se, at least you can identify genes that are differently represented and then maybe you later go on to find out the function of that gene, all right? So maybe you would, yeah, so you have control versus disease and all of a sudden you keep seeing this gene only in the disease section over and over again, then that seems worthy of a follow-up to try to figure out what does this gene do and then you go through lots of procedures for that. I especially also praise for all this... It's unknown. Isn't that? We should know everything Yeah. That's later today. We're gonna cover that today. Yeah, so we touch on that after lunch. Yep. Okay, so we talked about that. Another con is that you may not be able to sequence sort of rare microbes, so you can imagine with this depth because you're sequencing all the DNA, right? And even though you up DNA on your millions of sequences, unless you're doing crazy amounts, you're still gonna not maybe sample as deep into microbes that aren't as dominant, right? Whereas 16S, you're only targeting a single gene, so I don't know where it crosses over, I haven't actually done that calculation, but if you're really interested in sort of more rare microbes or things that maybe aren't rare or rare, but you get more depth on a single gene. Whereas with Mediomics, it's spread across many things. So you may not actually get the taxonomic depth that you would from 16S. Okay, another con is that the bimepmax tends to be a little more complicated with MediGenomics. 16S is pretty well described with Chime and Mother, and with MediGenomics, as I just mentioned at the first, it's pretty much a little bit of a wild west with certain things and things aren't as straightforward. That being said, there's kind of more fun things you can do with MediGenomics, so I'll talk about some of those today. The other thing briefly I wanna mention is just semantics of 16S versus MediGenomics, so people often use MediGenomics to describe sort of shotgun MediGenomics as well as 16S, and that happens in the literature and we named it in the workshop analysis of MediGenomic data and we talk about 16S and MediGenomics. Technically, I get a little cranky as a reviewer, maybe other people don't, but MediGenomics is really typically just shotgun MediGenomics by default. It's been a score of getting pushed into the other direction to be this all-encompassing thing, but there's definitely people that sort of push back on that, so usually if you're doing shotgun MediGenomics, same MediGenomics, if you're doing 16S, it's not great to call that MediGenomics in a paper. I know people do it sometimes, but it's just one of those things that some people get upset about. So. Yeah, you know, for me, what I do is I say, we'll do a microbiome analysis and then pretty much with an abstract, I'll mention 16S so people know what they're talking about or something, so you don't really sometimes wanna put maybe 16S in the title or something, but if you talk about microbiome and then make it pretty clear that you're doing a 16S analysis or Amplicon sequencing, I think that's a better approach than trying to use this MediGenomics. I know MediGenomics is a flashy word, but microbiome's pretty good too, or microbiota if you want. So when I'm talking about MediGenomics, I'm talking about shotgun MediGenomics, and so that's just my two cents on that one. All right, any other questions? That's good. Yeah, for reference, I believe that's a little bit smaller. For the MediGenomic data? Yeah. Like the reference genome databases? Smaller than what, sorry? That means we have a 16S lining. Yes, from the idea that you have a lot more 16S sequence than covering different organisms whereas reference genomes we have less. That's true. That's a true point actually. Yeah, that's good. Okay, so what I'm gonna talk about for this lecture is mostly on trying to say, if we have MediGenomic data, first question is sort of who is there? So this is the same sort of taxonomic profiles that you would get from your 16S data if you did it, but if you did MediGenomic data, it's like, okay, who is there? And so it's different with MediGenomic data than it would be with 16S data. So the goal of this is to identify the relative abundance of different microbes in a sample given MediGenomic data, so shotgun MediGenomic data. So there's problems with this. Problems are that the reads are obviously all mixed together because you're just, it's a big bag of reads and you have no idea what reads goes with which organism. The reads can be pretty short, so 100 base pairs is pretty still common, or 125, or 150, maybe it's getting a bit better. And then the other big problem is lateral gene transfer, right? So if you have a fragment of a gene, if that was lateral transfer, that can screw up your taxonomic assignment. I'd like to try to... So if the read has been lateral transfer to horizontally gene transferred, right, then you might assign it to the wrong species or wrong taxonomic assignment. Because the methods... I'm sorry, does it also have to do with methods or is that just enough? I think more than a ladder. So the idea is that... So when you look for an LGT gene in a genome, right, you would, they often stick out because it looks more similar to a gene and the genome it came from or the sequence composition of it looks different or there's other little signatures of it. But if you just have the read, you can't know for sure, there's no background unless you assemble first and then you can maybe try to predict that that gene's been horizontally transferred. So the gene, lateral gene transfer is that, can I make an analogy like transpose on things, like basically this gene in one material can actually transpose to another bacteria species or whatever, is that something we have to say? Yeah, yeah, absolutely. Yeah, so LGT happens in microbial genomes through transduction, transformation. Jeff, you're the one right now. Thank you, Conjugation. Right, so it depends on how often that's happening, right? But you can imagine that people don't build phylogenetic trees from genomes using often every single gene because they know that some of those genes have been transferred and that messes up the tree, right? So Ford do little as big on this and basically says you can't even build a tree of life because over time almost every gene has been horizontally transferred at some point, which makes making a tree of life almost impossible. That's a whole other theory I'm not gonna go into. So usually when people build trees, a lot of the news may be a conserved gene that they think is unlikely or hasn't been shown to be laterally transferred as much, so like a housekeeping gene or something. But anyway, it's just something to keep in mind. Are there like algorithms in there? In genomes or in many genomes? Not really, so I am working on a project where we're trying to develop a tool to do this. So the idea that you start with assembling leads into longer contigs and then using that information then to try to predict LGT events within those assemblies. So it's called Waffle. It's from Curtis Eisenhower's group, but we're collaborating with him on that. So I think for metagenomes, that's about the only one, but for genomes, there's been a lot more tools out there. Sometimes it depends on what your data looks like. You can use genome tools for metagenomes, if that makes sense, all right? Yeah, so my PhD was on identifying genomic islands, so large regions of horizontally transferred data in genomes, so that was a lot of my PhD work before I started going to microbiome work, so. Yeah, if people want to talk about LGT later, let's do it. Okay, so I like to sort of try to bend these major approaches for taxonomic assignment into two major categories. So, binning-based and marker-based. So for binning-based, the idea here is that you're taking all of your reads and you're trying to group these reads into back into the genome that they essentially originated from, right? So you have a big collection of genes, putting them back into their little bags of genomes. So two major approaches within binning-based is composition-based approaches where you're using the sequence composition, such as something as crude as GC percentage, where you're just literally counting Gs and Cs and each organism has a slightly different GC content, or you use something like K-mer approach, where you use different lengths of K-mers, which are nucleotides. So if you had a five K-mer, that would be just looking at all, all sequences of five nucleotides and then looking at the differences between organisms, and you can use approaches like that to then feed into different types of classifiers to bin the reads. So with those approaches, you're not actually doing a sequence alignment, which is the second approach. So the composition base is nice because it's relatively fast. Sometimes it can get things wrong just because the sequence composition doesn't tell the whole story for the organism. So a sequence base, this is probably the most, I think, sort of natural way that you would probably think about doing something is you basically compare the reads to a large reference database, and you would do that using some similarity search like BLAST or some other faster method. And I think that's the sort of more natural approach. So you just have your reads, you blast them to some database, it hits it and you say, okay, that's the thing, that's the thing, that's the thing. So for that approach, reads can be assigned on sort of a best hit approach, where you just say, this is the top hit, we'll call it whatever hit here. The problem with that is, of course, if there's a big difference between a best hit that's 100% identical or a best hit that's 60% identical, right? So you wouldn't want to call the thing that is 60% identical to the reference, the actual species. So one approach for doing that is to use like lowest common ancestor. And with lowest common ancestor, you would take your, has everyone done a BLAST before, hopefully? I assume people do BLASTs. No, yes, sort of, okay. So BLAST is a similarity search, you get back a whole bunch of hits from your read to the database, you determine some cutoff above whatever it is, and you take all of those hits and you ask usually what is the, what's the lowest common ancestor in the tree that represents all of those hits? And then that's your tax on assignment. Does that make sense? All right. You're still not quite clear. Sure. You have like maybe five hits that meet your criteria. So how do you decide that we belong to which of the five? Yeah, so you don't. So you sort of make your assignment not specific, right? So it's usually, there's more information layered on this, but you, I think I had to do this last year. Can I erase this since the notes are on the back? Anyone care? I'm asking you, man, mostly. It's on the back wall too. Okay, so say you have a read and then you get a, you get a match, you get a blast report. So this is, you know, hit one, two, three. Okay, and this keeps going. And then we have some cutoff. So we use a cutoff like this. This is a general approach. So you can imagine, say, say these are anywhere between, as a good example is 80 to 90% identity. So you can't really be sure that this is the, it could be that this is the best call, but you don't want to make that assignment based right to this strain, right? So the general approach is you have a tree representing these genomes. And so say one and two actually is the same, same exact species and this is the species tree. And say these things map to these things over here. So depending on your cutoff, if you think that all these are really significant, you wouldn't do assignment here. You would do an assignment, say, to this thing. And so if these all share the same, say they're all in the same genus, then you just do an assignment down to the genus level for that particular read. You can get into more sort of elegant approaches to this where you start waiting the reads based on the percent identity and then playing around with this a bit, but the general approach is to sort of do this attempt. Does that make more sense now? Right, so I just described this. So notable examples of tools that use Lowest Common Ancestor is Megan, was one of the first metagenomic tools around. It also does functional profiling. NGRast is an online server that you just submit your data and wait for weeks or months. And then the data comes back and usually you can do different types of assignments. So sometimes you'll see best hit or Lowest Common Ancestor, although I haven't used the tool in probably a few years so I don't know if those are the exact doctrines, but you'll see cutoffs as well. And then Kraken is probably one of the more newer methods and Kraken is a really fast sort of binning-based method that speeds up similarity tremendously for metagenomic reads. The downside to Kraken is it probably has very large computer requirements, very large being about 128 gigs of RAM, so you need some sort of relatively beefy server. But once you have that, then Kraken's really quick at doing the assignment of reads to databases. And the nice thing about Kraken as well is that you can build your own sort of custom database very nicely. So if you wanted to use old NGR, plus some of your own genomes, you would create the database and you wouldn't have to do this slow blast Kraken would index both the database and your reads and then it's a very fast similarity search. And it does taxonomic assignment within that as well. All right, does that make sense? Okay, so that was sort of this bidding approach where you're trying to use all of your reads and your data, right? So the other approach is that you only use a subset of your data to try to figure out the taxonomic composition. So most of the time, people actually do wanna get all of the reads, hopefully, in their appropriate spot. But if you're just interested in the relative abundance of these organisms, you actually don't have to do assignment of every single read, right? You just have to figure out the composition. So one approach is to use marker-based. So with marker-based, you could do a single gene so you could actually extract, say, the 16S gene from your metagenomic data and then proceed through the pipeline doing an approach that's been done before. People sometimes do it to compare even if they're, say, they're primers or biased or not because you're getting the 16S from the metagenomic data. But you could also use other universal genes so you could use CPN60 or anything else that's conserved that you think would be a good representative. And then you proceed through using sort of existing metapipelines like CHOP. Besides just using a single gene, you can use multiple genes. So there was this tool called Filocyft which uses 37 different universal single copy genes. And it uses those conserved genes to then figure out the taxonomic composition using that approach. Another approach which I'm gonna talk about in more detail is Metaflan 2 which uses clade-specific markers. So instead of looking at universal things, it's actually looking for markers that are unique only to the clade, only to a particular part of the tree. And I'm gonna show graphically what that looks like in more detail to explain it. Okay, so marker or binning, what should you do? So binning approaches, the advantage or if you look at binning approaches, the similarity search is pretty computationally intensive. So you can use things like Kraken or maybe Diamond, which I'll talk about this afternoon to speed that up. A downside to binning approaches is that there's, I don't know what I meant here, varying genome sizes. So I talked about LGT already, varying genome sizes, I don't know what I mean there. Oh, I do know, sorry, yes. So you can imagine that if you had, okay, so genomes come in different sizes. So you can imagine if a genome was twice the size of another that if you use this binning approach that you would maybe get double as many markers back for the binning approach, right, if you're using all the reads. And then so if you're looking at your taxonomic composition, you would have maybe double of that organism represent just because its genome is bigger, right? But that doesn't reflect in any way the actual number of say cells or the actual things, does that make sense? That doesn't explain super well. Okay, so yeah, so if you could assign taxonomic assignment to all the reads, then bigger genomes would just be overrepresented in your relative ones. And it's something that most people I don't think ever really look at. Okay, so if you look at marker approaches, the downside to that is that you are only using a subset of the data, so it doesn't allow you to link the functions directly to the organisms because you're basing your taxonomic assignment only on a few genes. So for all the other genes, say you annotate them later with functions, you don't have taxonomic assignment for those and that's something that people often want and it's probably the major downside to using marker approaches. For marker approaches, genome reconstruction assembly is not really possible, so you're only using certain markers and so you're not trying to do this binning. So if you use the binning approach, you can sometimes filter reads and do the appropriate genome and then you can do genome assembly afterwards on those binned reads to hopefully reconstruct genomes or partial genomes or just at least longer reads or longer contigs. And then obviously marker choices, you have to choose which marker you're gonna go after so it's dependent on your marker choices. And if those are biased in any way, then obviously that can affect your predictions. Yeah. No, I think that's fine. I think people do talk about genome sizes and about what that might mean and people will get into discussions about sort of bigger genomes tend to mean that things are more general, generalists, so they tend to do better in say soil or something or they have a larger cohort of genes so they're seen as more general and then more smaller genomes, people talk about being more specific to a particular environment or something, so people discuss that. But I guess what I'm saying here is that if the goal is to make a, whatever, a pie chart or a stack bar chart or some representation of these are the things in my sample and these are their different relative abundances, we wouldn't want just the bigger genomes to be overrepresented within that representation of the data. Because you're still just looking at the percentage of GCD or the k-mers and the size of the, they actually give you everything you're still saying? Yeah, so I think what I'm saying is that the, so if you use k-mers on the reads, you would still, whatever it is, even if you did assembly first, but you're counting something, right, to get your relative abundances, so you're doing assignment to these reads or even maybe it's contigs and if you count those up, you may just get more hits because the genome was bigger in the first place. I actually don't see people ever discuss this in the literature, have you, John? Have you seen people address this at all? I think I thought about it when I was making the slides. So gross and cons, but it's something that, I know I don't think most people think about it, but anyway. I have a question that's for you. Yeah, sorry. So I had a question recently, and we talked about this. Oh, did you? Cool. Yeah, so where that was contaminated by agricultural activity, a lot of that wasn't contaminated agriculture on the ad is, like we're just saying, you look at the few functions that are present in these two communities and one's like, hey, is there a difference in the number of G's that are associated with the two different communities? But one thing we noticed was that there's actually a really big genome size difference between the two communities. And if you don't account for that, going into the analysis, then it skews how, first, how many units get assigned to existing genes that have adaptations. And then what's the percentage of abundance of these different, like phosphorus? So they specialize genes compared to just genes that are incredibly average. So it can be, it's important to look at. If you want to look at, get more information on this, I can move in two papers. One called Music, M, U, I, M. So Music, the next one's C on hand. And then the other one was Microsensis, is another one available. Is Music, what does Music do? I've seen it a few times. Yeah, so Music is actually trying to do a really comprehensive mobilization based on a bunch of abundance. And so it takes it to account genome size, but it also takes into account how, you know, if you have a longer, if you have a pathway, for example, that has more genes in it, you're more likely to detect that pathway, as opposed to a pathway that has less genes in it, right? So it'll take that into account and it takes gene length into account. And it tries, what it tries to do is to go from a number of reads assigned to the pathway to a number of cells with this gene. So you can go in with the functional buttons and you hopefully come up with something closer to a first cell kind of function to be a function. Yeah, and we'll actually discuss that some more about functional stuff and, because Humon does some of that too, so some of these same things come up. But yeah, I hadn't seen as much on the taxonomic side which I had to normalize for that, so cool. Okay, so why am I discussing Metaflan as their tool that we're gonna, I'm gonna demonstrate to you. So there's a couple of reasons. So one, it's relatively fast. So it's kind of nice for people to be able to get their profiles quickly. So if you have your metagenomic data, it's relatively easy to give you a taxonomic profile on a regular desktop. It has markers for bacteria, archaea, eukaryotes and viruses, which is quite nice. So you get at least that representation so that if, where other approaches, sometimes you really have to, sometimes they're more tailored towards, say, bacteria and archaea, but not viruses. And if a third of your sample was viruses, you just wouldn't know until you did a separate search or something. So at least Metaflan will give you, hopefully, a high level of that. That being said, markers for viruses aren't perfect because viruses are very diverse. So I don't think probably that side is perfect, at least it makes an attempt at trying to give you proportions of all those things together. All right, it's being updated and supported, which is nice. It's not completely just been abandoned for a few years, which happens to a lot of bioinformatic tools, and then you're kind of left wondering what's wrong with it. So it's nice that people are doing things to it and fixing it. It was used by the Human Microbiome Project. It's generally accepted as a robust method for taxonomy assignments. So if you're worried about reviewers, it's generally accepted. I guess it's one of those things that you have to take into account. But the main disadvantage, and I just mentioned this, but is that not all the reads are signed to taxonomic label. So you really have to keep that in mind. It's gonna give you a profile of things in your sample, but then you can't dig into it and say, okay, now give me all the reads that were assigned to this particular genus or something because I wanna do functional annotation on those things. You can't do that. So if that's what you're gonna wanna do, then you're gonna have to do some other approach, like a bidding approach that uses camers or sequence composition to then do that analysis. Does that make sense? Okay, great. Yeah. Potentially it's not a good thing, though, that it's not either misleading you that things need to be annotated something. What do you mean? Well, the read is ambiguous and you don't know what it belongs to. Right. So these other approaches might force it into something which isn't necessarily correct. And so by leaving these things annotated, maybe that's kind of not a bad space to be in because you need to have that work actually in the long run. Yeah. There's almost for the better. Yeah, I completely agree with that statement. I kind of like the marker-based. If you're looking for taxonomic really getting a profile, I actually like sort of the marker-based approaches a bit better for that reason. But then you end up sort of doing the other taxonomic approaches when you have to do more stuff with the raw reads, I guess. And for the same reason of exactly that. So you have a lot of reads and you don't know where it goes, right? And so maybe the algorithm assigns that to some higher level but then you're left with a whole bunch of stuff at some family level that is not super descriptive. And of course the whole LGT problem as well. So I mean, if you're trying to assign taxonomy to antibiotic resistance genes or something that are already transferred all the time, then that could be problematic. So I think for getting a nice profile of the data, I think the marker-based approach is actually cleaner and you have less sort of false positives. Yep. I haven't seen anything sort of formal. I think people do do both. I think the problem is you'll get sort of different, things will look different, right? So that if you did a stack bar chart of say a bending approach versus a marker-based approach it's probably gonna look slightly different. But I think people will do both because they wanna do different things with the data. So I think that's a fair statement. But a hybrid approach that really incorporates the two data and says, this is the answer. I haven't seen anything like that. Are you seeing anything like that, John? No, I haven't seen like this all in one package deal that try to take the best of both worlds. That would be useful. Sorry about? Sure. I'll write a grant. I don't know. Yeah, no, I think it's a good idea. I mean, I think, so you can imagine that if you did this marker-based approach and maybe, yeah, I hate saying things have never been done because there's a lot of stuff out there. But you can imagine if you did this marker-based approach you get these nice, pretty confident bins, right? Organisms that you really think are there. And then you could take the data and then try to push those into those bins accordingly. So I think that'd be the best hybrid approach. I feel like maybe, I don't know why that hasn't been done. Cool. I'm getting lots of ideas, this is awesome. Yeah, in the back. I think so, yeah, I think viral people would try to scrape out every virus sequence they could from a reference genome database and then use some similarity search to do that. So Metaflan uses markers to give you a proportion and they give you some labels, which is pretty cool. But I think if you're really going after viruses, I think you might use Metaflan first but you would really use a big viral database. We're starting to do some of this for a project and then run, say, Kraken with it or something or some other similarity search to try to annotate. Yeah. But of course the problem with viruses, of course there's always, there's prophase, right? So there's within the genomes of bacteria. So sometimes it gets a little tricky knowing if it's actually a virus that's been out or if it's just a virus in a bacterial genome, but challenges. So what does it do with the essential chunk of the, I just kind of forget about it? For the marker-based approaches? Yeah, for the metaflan. Yeah, so, yeah, absolutely. I mean, so again, the goal here is to give you the best representation of what's in your sample, right? So it's okay that's throwing away a lot of the data to do that. So you can imagine throwing all the data except for 16S and using that as your profiling. So Metaflan uses more of that information. But yeah, it's not using all of it. But there's debate about which would be better, right? So do you try to use all the reads and then have false positives or do you use some of the best reads and try to use that as your assignment? Okay. Okay, so Metaflan uses these clade-specific gene markers. I'll show you in the next slide. So a clade represents a set of genomes that can be as broad as a phylum or as specific as a species, right? In a phylogenetic tree. Metaflan has about a million markers derived from 17,000 genomes in their last database. So most of those are bacteria and archaea, some are viral, and some of you are eukaryotic. They can identify down the species level and they even claim that sometimes they can identify down to the strain level. And they can handle, I mentioned before, it's relatively fast. You can handle millions of reads on a regular desktop computer, which is pretty sweet. All right, so this just gives a representation of what I guess Metaflan's trying to do here. So the approach is that you're looking for markers which are red. So these are, you can think of them as genes, but it could be even a part of a gene, but usually they're genes that are both conserved and unique to a particular clade of your dataset. So you can imagine that these are maybe genes, so around the circle, but since there's genes that represent all of the tree, those would be good markers, whereas something that's been only shown to be within this particular clade would be a good candidate for representing that clade. And since it's conserved in these donuts plus this, the assignment would be sort of up to this level. So that's just one marker, right? So they don't just depend on the one marker, they use a combination of markers, and then they use, I'll do the hand way, the algorithm to then figure out the best assignment based on multiple markers per taxon. So that's where they're sort of hunting after. Don't think I need to really get into this. They have, so they have the reference genomes that are identifying the markers that they're trying to use. Then they have this database, and then they use that database now to assign your needs against those markers. And then they kind of normalize and give you, hopefully, your pie chart or we're stacked by our chart list of data. All right, does that make sense? All right. Okay, so the nitty-gritty metaphor I use is both tie two for sequence similarity searching. So both tie two is a, like a DNA mapper. Traditionally they're used more with human genomes to map reads back to a reference genome or it doesn't have to be human, it could be any genome. So mappers are really good when the percent identity is quite high. So if you go, I don't know for sure, but you start to get a little, if you below 85% identity, but traditionally for things that are like 95, 97% identical, mappers are great because they're really fast and they're pretty accurate. They're limited to nucleotide versus nucleotide, so it doesn't do protein space mapping. And there are other mappers out there, so things like BWA exists and some people would argue BWA is better, but for a reason, Metaflan went with both tie two as their mapper. Couple caveats, so you can give Metaflan paired end data. So you have your paired ends, they might not be overlapping with metagenomic data. Ooh, yeah. I may come back to paired end stuff because I think it always comes up and I don't have a slide on it, but anyway, ask me later. So it goes paired end data and it takes these in, but it actually doesn't treat them as paired end. It basically just says that they're completely independent reads. So those things get counted twice and they get mapped completely separate. So it takes as an input, but it actually doesn't treat it as any paired end data, which is good or bad. Each sample is processed individually and then it just basically combines the data into a single table and the output is relative to the bunds is at different taxonomic levels. I think I just wanna mention paired end data right now. One thing. Okay, so for metagenomics, so with Amplicon data, you have a really good size, a good idea of your Amplicon fragment, right? So for the stuff that we do, we do usually a 450 base pair and if we did paired end 300, 300, you know that you're gonna get an overlap, right? So if you have your library size and all the fragments are exactly the same, it doesn't even have to be the same Amplicon, but if you had a really nice library construction and they're all identical, you could construct this to say to optimize your overlap. So say if you're doing a Lumina 150 base pair, you could design this in a perfect world to be, say, 250 base pairs and you would get a nice overlap between your reads and then you could stitch those two together, right? Typically that doesn't happen very much with metagenomic data and the thing is that's really hard to get an exact library size in your construction because of the way that the DNA is fragmented either through sonication or often more now with thank you transpose on restriction that's sort of randomly cutting the DNA into different sizes. So what you end up with maybe is a read that's short, you could have one that's 100 base pairs or even less and then you get other ones that are long some that are short. So when you do your paired end sequencing with Lumina then sometimes they'll overlap and sometimes you'll get a space, right? So a question I often get is, you know, should I try to stitch these things together with a tool like pair or something? So one approach is to stitch them and then see how much you get and then use the stitched stuff and then the non-stitched. Another approach is just to not do any stitching and then just completely lose the paired end reads either independently, so you just can kind of get them together or you use a tool that maybe tries to map the data together as paired end. So I guess my point here is that my point is that you'll often get a library that's basically different sizes and that they're not always gonna overlap in your library and that stitching is only gonna give you a percentage of them back as overlapping. So traditionally we just don't bother with stitching now on our meta genomes, we just use them as two different reads or if the tool handles the paired end data we'll use it that way. Does that make sense? Okay, all right, so I have a quick blurb here about yes, quick caveat, I should have come up with like Morgan's annoying reviewer comments or something. So one little pet peeve here is about absolute versus relative abundance. So I think most people know this but I just like to hammer it home. So the important thing to remember is that absolute abundance would actually give you numbers that represent the abundance of things that are measured. So the actual quantity of a particular gene or an organism. So when we're doing sequencing we're actually counting the cells in any way, right? Relative abundance just represents the different proportions within that sample. So you have no idea if sample A was more or less than sample B but you know that the relative abundance of different organism changes. So whenever we're doing microbiome studies we're almost always working in sort of relative abundance space, right? We're talking about the changes in different proportions of functions or organisms. We're not talking about the absolute abundance, the quantity of the organisms. And the reason for that is because we're usually amplifying at some step the DNA app so at any point we don't have the actual raw material of the DNA, we're not keeping track of that. And so we're losing that information and it's not quantitative. So a really simple example of this is that you say you have two different sample. Say sample A has 10 to eight bacterial cells. We don't know that from the sequencing. And say 25% of that microbiome was classified as Shigella, great. Now you compare that to sample B and it actually has about 100 fold less bacterial cells. You don't know that information. You don't know that the bacterial load is actually less. But with that one you say 50% of the microbiome from the sample is classified as Shigella. So you're writing up your paper and you say that's really cool, sample B contains twice as much Shigella as sample A. Wrong, so that would be wrong because in theory, in reality, you actually had more Shigella in sample A, right? So don't say quantity or just as much. So you always have to remember that we're talking about proportions here so you would say we found a greater proportion or a greater relative abundance of Shigella compared to sample A. I know it's sort of, again, little nitpicky things, but most people understand that. It's just you don't want to interpret the data wrong, especially with things say like Shigella or something where people start talking about pathogens and then assuming that because there was a greater proportion that this set sample could actually be whatever, more dangerous or something when you have no idea about the bacterial load. So people often sort of gloss over this with samples say where stool is being used, where the bacterial load is thought to be kind of close enough that it's maybe not as a big difference, but if you're using other sites where you can imagine the bacterial load would be completely different. So if I swipe this table and then I compare it to my nose, well the bacterial load is really different and it doesn't really make sense to talk about the absolute quantities there at all, okay. I know it's probably over the top, but whatever. Yeah, question. So, but that's you don't have to say that. Yeah, I think it's okay to say that. Yeah, well based on your prediction of the taxonomic assignment, yes. That's what we're trying to get to. So we're trying to report on the proportion of your sample. Okay, so how are we doing? Yeah, so we'll touch a little bit about this. He's talking about variable regions and things. So yeah, it does make that hard. So if you're trying to combine data sets, especially at the read level, say you're like, okay, I have these 20 samples of mice and then we have, I want to increase my sample size. So I'll use, well, okay, that might. Say I did a human study and my sample was 10 disease, 10 control and I'm like, oh, I'll just use the HMP data, human microbiome project to like increase my control size. Perfect world, that'd be great. But in reality, the HMP data is gonna be completely separate because there's so many differences starting from how the stool is collected, whether it's frozen right away or not to the DNA extraction method, which will give you different results to how you construct your library and to the sequencer and what sort of sequencing you did then to the bioinformatics, even if you did all that stuff exactly the same, you could do the bioinformatics really different. But even if you took the fast Q files, you would still often see differences. So most people are pretty skeptical of combining data set across studies for now. Yeah. So what do you have on that way? I'm going to ask you a question on that. So for example, we're trying to, and yes, we're trying to be standardizing possible, but things happen, some samples take, it's not a time to get to the lab, some don't, some of the parents do stuff with the samples that you didn't want them to do, and you're like, shut that sample, you're gonna be lagging it, it's been ripped off, some of the samples. There's very little impact on how the samples are collected. We're really struggling to know what how to deal with that because what's important, is it important that Toronto had a new way for the samples to collect it, or is it more important that it took three days to get to the lab on the point of their group? Do you know what I mean? Yeah, yeah. You're sure that they don't find the address exactly your question. Yeah, so on the stool side, I guess, for the collection. So it depends a little bit from what we've, what I've looked at. So you do see a difference. Say, I think there was a nice study looking at sort of fresh versus frozen, and then also one about sort of how long it sat out at room temperature, or then, if it was frozen at minus 20 or minus 80. A decision is like you're saying they're free source. Yeah, so I think from the bio-anthematics side, the best you can do is just to try to label those things, right? So maybe you can flag them later. So if you at least write that information down, you can later on check in your PCA pod or something that it's not, that all the samples that were collected at home and maybe just frozen at some point, all grouped together in the plot, right? But from the data so far that I've seen, I don't, it seems relatively stable for DNA from what I can see. I mean that, so you might get a little bit of separation on PCA plots, but the actual taxonomic abundances aren't changing that much from what I've seen. So, but every clinician I've talked to is doing the same sort of problem. So some people like are, yeah, throwing an ice cream bucket and it goes into the freezer at home and then they bring it in the next time. So it's a problem. I don't have the answer on that one, but everyone's doing the same thing. No one's got it ironed out perfectly. But from what I can say is I think the, it seems like freezing is the better choice as opposed to using some other method to sort of worry about degradation. So it seems like freezing is the better approach if you can do it. Okay, oh just that, throwing an ice cream bucket set up in the log and it's going to sort of reflect in the story examples and some of you mentioned that having this role is important. We can't change that, that's too bad. We need to know in reality, what isn't that that part? I think the expectation is that it's going to have a dramatic impact on the sequence and it's going to be a little bit different. For example, but that's all you've got. So I think that's what we can do. Yeah, and I think you're always hoping for signal, so it's like biological signal of what the difference is. So if there's not much difference between your control disease, then anything that sort of alters or gives noise to that is going to affect it. But if you have a big difference, then you're still going to see those big differences even with the noise created by the storage technique or the extraction technique. So it often comes down to that. So there's been a couple of studies looking at metadata studies and sometimes they'll claim, oh yeah, we can still combine this data but it's usually at like, yeah, we can tell body sites differences or something, which is a major massive difference, right? It's not the subtleties of trying to figure out the different organisms or functions within a sort of disease state or something. So I think it does depend on how big the differences are in the first place. So you're just adding noise at these different levels. And it's what we also look at in mind that most of the studies that we go into have a kind of like in studies you could just consider it to be what maybe you don't care. It was just. Here's a lot of blog posts so you'll have to search a little bit. If you have used this name plus storage or something, I think you'll find a post. No, but you could do it on healthy controls, right? I mean, it's, it wouldn't take, you could do it on yourself, pretty easy. PCR studies for degradation of fungal genes to find out how fast fungal genes degraded in the soil and how people was buried one month. And we still were getting DNA there. So as long as you're kind of in a range I think you're probably good, but it's actually a result of some of my decaying too. I don't think it's the problem that the DNA is going to be there. It's the problem that things degrade slightly different across your species, right? Obviously, concern, especially, it sticks around longer. I read a paper from a night's group that they see that actually the sample started field study to see that how long that, so I could just show you a sample, so just do a sample. I've read that, though. You've got to know it's actually the month that I think they compare the steel farm. Right. Steel accurate enough. So I think that they work. So is that the conclusion? It's the same. So it's not even like in the ice or anything. Yeah. We're always like, people always say you put the ice right away. That's a bit of a difference that we have the other way. But it's a little bit disappointing that these studies have been published for 2008, 2009, and so forth. Now, I mean, now, I think we're ready to look at how this kind of evaluation is going to progress. And I think the HMP website doesn't even miss the best methods of storing and connecting. Yeah, it would have been nice if the HMP did that. If all the other things they did, it would have been nice if they hammered out all those variations so that people start. Yeah, there was a study in class one entitled. That does sound good. The sections or something. Yeah. Yeah, I think people sort of take exactly like homogenize a section of it, right? So the surface gets mixed in with the core. Wow, it's really descriptive here. Let's morning talk, get in the corner. Rockin, that was awesome. All right. Okay. Visualization and statistics. Okay, so there's lots of tools out there to do statistics and there's lots of tools out there to visualize data. So I list here just the common ones. Excel, yeah, people poo poo on it, but it's still probably the most widely used tool going around, so some people use Excel. Sigma plot, it's a nice plotting software. R is a really big one. So there's tons of packages in R. It's a little bit of a learning curve that's getting a bit better and better with the tools that help wrap it, but R is really popular and it's super powerful. I thought I updated this slide. Anyway, that's not what I meant. Oh, there's two slides in here. Look, that's my old slide. Here's the new one. There we go. Ignore the other one. So Excel, Sigma plot. Past is this cool open source software for doing stats which is actually really nice. And I just, one of my students just found it recently and it's really nice. If you're looking for something that's kind of like Excel but actually useful, past seems to be it. So R as main libraries, Python has nice plotting libraries as well and then what if you combine that with things like NumPy, you can do stats and visualization, but that requires coding skills. So the tool that I'm going to talk about today is called Stamp. So Stamp was published, I think originally in 2009 or 10, but then they did another one in 2014. So my name's not on the paper. Actually, my name's not on any of the papers today. So it's completely unbiased as opposed to yesterday as like PyCross is awesome. But today, all day, I'm basically not on any of these papers. It's just the tools that we tend to use which do change over time of course but that's just my recommendation. So although I do have slight bias here. So Rob Biko used to be my postdoc supervisor and so I have to still talk to him quite a bit now but he allows me to do whatever I want of course. Okay, so Stamp is this really nice graphical system. So you install it on Windows or Mac or even Ubuntu. And I really like it for, I think it melds enough sort of stats and visualization with biologists that want to explore their data and look through it. So I think it's one of sort of my tools that I always point to for people to get to start with their data. So I'll just give you a quick walk through the design. It seems like a lot first but after a few minutes and you'll get to play with it today, I think you'll start to pick it up. There's quite a few drop downs and things but once you learn it, it becomes pretty powerful for looking at your data. The nice thing is completely agnostic too so you can load in taxonomic profiles or functional profiles or heck, you could just throw in any table of data and it would work on the data. So from that standpoint, it's quite nice as well. Okay, so the main features of the, let me see if I can use this. Okay, so you have this visualization area in the middle. Over here, you give it both your profile, so your say your OTU table and then you also give it a map file. So the map file contains the first file column is usually your sample IDs and then you have different metadata about those samples and you'll get to see one in the tutorial today. So it takes those two things, right? So the map file can contain lots of different groupings and then so you can quickly change it here on the right hand side. By the way, enterotypes are awful, don't ever talk about enterotypes. But anyway, these are different groupings for the sample so you can quickly change how the samples are colored and so that would then change the coloring say in this PCA plot. So the other big thing is over here at this profile level, so this is at the genera level but so you can quickly change what taxonomic profile you're looking at. So you could go to species or family depending on what the table look like. And then for the stats, you can look at sort of multiple group tests. So I know this style as well as two group tests, so sort of like T tests. And then you can also choose from different statistical tests. So there's both parametric and non-parametric methods and the nice thing as well is that there's, I know this keeps missing, is that there's multiple test correction. So you're gonna almost always, always apply a multiple test correction. In theory, you wouldn't have to apply one if you knew for sure beforehand, that's like I wanna see if this particular tax had changed and you didn't look at your data at all, maybe you could get away with it. But since we're profiling usually thousands, hundreds of thousands of OTOs and thousands of different functional groups, you have to apply a multiple test correction before reporting your P values or Q values. Okay, so then after applying, so after applying this, it'll actually tell you how many passed that statistical method. Down here, there's a nice little browser for your sample data. I don't use this as much, but it's really useful if you want to sort of try to filter out samples going into the analysis. And then the big other thing I wanna show you is it's kind of almost not really obvious as this guy right here. So this is PCA plot, but under this, you're gonna wanna try the different visualization. So under there, there's box plots, bar charts, heat maps, things like that. So just try them out. It'll update sort of your visualization. And then the one thing it's not showing here is from the stats, you'll get a list of the things that are statistically past the test or they're statistically significant. And then clicking on that will give you an updated box plot so you can browse the data and see what's happening. Anyway, you get a chance to play with it today. So you can make, yep. No, no, so that's the one thing. So I'll talk about microbiome helper in a second, but basically it takes a plain text file that's a little bit different than other methods. So I have these scripts that go from biome to stamp for O2U, there's a command called biome to stamp. And then you can give it an O2U table or pie crust predictions or there's also a humon to stamp tool that I wrote. So that sort of converts the file for you and then you can plug it into stamp. You can do it by hand as well, but that's, you don't want to do that. Okay, so basically, yes, you can do these different visualizations. I just talked about this, the fact that it's taking two things. So the profile file is that O2U table or functions and the group file is the metadata file. I already said they can do PCA plots, heat mass blocks and by plots. Oh, so the PCA plot is a PCA plot, not a P, it's not a principal coordinate analysis. So it's not like a unifrack analysis. It's just doing a PCA on your abundances. It's not taking account like a distance matrix. So usually if you're doing an O2U table, you wouldn't want to, you can actually shouldn't say that, but it's just traditionally most people would apply a principal coordinate analysis, but PCA plots are nice too to just see how the data looks. I'm just letting you know that it's different from what you would get, obviously, from a unifrack and then a PTOA analysis. So stamp will actually, stamp will actually usually convert to relative abundance before it applies the statistical tests and then, and it'll often, you can either actually choose then to show say for a box plot the relative abundance or if you gave it say read counts that it would show that number as well. So you can pick the axes, but for the statistics it applies it almost always converts it to relative abundance space anyway. Or just keep them O2U tables. Yeah, but that doesn't look after the problem of if you rarefied based on number of sequences per sample sort of problem. Do you know what I mean from yesterday? So we talked about, so I think we'll talk yesterday about rarefying the O2U tables so that the number of sequences per sample was the same across all samples, or there's also these methods that do other types of statistical normalization of your table. So in programs like DSEC2 which is often used for expression normalization. So there's some debate about whether to do sort of a DSEC2 normalization method or whether to still do rarefaction or sub sampling of the data. So for O2U tables though, you definitely should do that before you put it into this. For metagenomics data it's still up in the air. Most people actually don't bother. I think it's because the number of sequences are much higher and so they don't worry as much. But you can imagine if you did, if for some reason you had double or triply amount of sequences for one sample that could bias your result as well. So I'm still a little worried about just using metagenomic sequences without taking that into account a bit. Okay, so I have a couple more things to discuss. I'm just trying to start off with one, do this now. Yeah, let's just push through. This is actually pretty light data that a couple of light things I wanted to put in somewhere as I thought it would fit here. Okay, so putting it all together, so metagenomics workflow. So there's this thing that I made, so I lied, so I'm gonna push on all my own stuff today. So Microbiome Helper is this package that I wrote that's not really super novel in any way. It's basically the glue that puts together a lot of tools that we use in our lab. So it's an open resource. It's mostly aimed at sort of people like you, I would say that are trying to get into field and trying to figure out what commands to run. So there's a few parts in it. One, it combines bioinformatic methods from different groups and we sort of update it whenever we think there's something better out there. So there's a series of scripts that sort of help like say the biome to stamp or converting file formats or even just wrapping commands that make it easier to run on a system. And then we integrate that data so you can download the scripts just by themselves but we also put it into a virtual box. So that includes stamp and all the packages so you could run the virtual box on any computer you had and then you could just not worry about the installation problems. Has anyone used virtual box before? So for people that have no idea what virtual box is it's just like this thing that runs on your computer and then lets you run another operating system within the operating system. So you can imagine you had a Mac and say you had something that only runs in Windows, you could make a Windows virtual box and it runs that within your system. So this image is an Ubuntu image, it's a Linux based system but you can run that on a Windows machine or Mac or anything and it just runs within that. So the downside of virtual boxes is you lose a little bit of computing power because it's doing this virtualization but from a sort of ease of deployment and getting set up and running it's quite nice. Okay, on the Microbarm Helper there's tutorials and walkthroughs. The tutorials are the ones that you did one yesterday and then we have two today but there's also some standard operating procedures that we've put in there as well. So I think it's fairly useful sort of starting from fast queue files all the way through and then you sort of can do step by step. There's the URL, that's what it looks like. Let's just say example the virtual box so you can see that this is like a, this is a Windows operating system that's running within this window, the Ubuntu image and then within the Ubuntu image you have access to command line, SAP is installed, Chimes installed, PyPress is installed, all that fun stuff. Okay, does that make sense? Cool. If you have recommendations on Microbarm Helper let me know if there's things that you want added for people that are more bio-athematic savvy if you think would be useful we're sort of incorporating different things. Okay and then the last thing slightly infomercial on this side but I just thought I would make people aware of it. I'm not trying to push an agenda here but I do offer a sequencing and bio-athematic service support out of my lab, it's called the Integrated Microbarm Resource. So we do a lot of 16S, 18S, Amplicon sequencing, we've developed this pipeline where we can sort of turn around data pretty quickly. The protocol for this is somewhat online and we'll be putting it as the manuscript gets put together, we'll be putting the whole pipeline openly available so if you want to make sure take our pipeline and run it on your own MySeq it's all solved for you. There's some others that are like this so EMP sort of put out a protocol but I think ours is a bit more descriptive and maybe it's easier to show how you can adapt it. And then on top of the sequencing platform is sort of just this general bio-athematics workflow where we describe how we process our data. Okay, cool thing is it took off big time so we started January 2015 and we're actually above 10,000 samples now which is crazy, people are just sending us samples everywhere but the nice thing is we've kind of ironed out all the bugs and things is running really smoothly. We've done stuff from oceans to soil to different hosts associated things, cool things like cheese and stuff I've never even thought of. Originally we're mostly just doing it from Halifax and then we started getting samples from across Canada and now we're getting them throughout Europe which is pretty exciting. The pricing is available on the website, it's also here. I think this is actually really useful for people that are writing grants so if you just need to figure out how much it costs per sample, it's right there and of course if you're a nurse you can send us samples but that's that, I just want to say it's out there. Okay, any questions?