 All right, so I think we'll get started with our next module. So I guess everyone's got a chance to try things out on the command line, get a sense of doing some bioinformatics, got to listen to Will most of the morning, and you're on your way. So this module is a little bit shorter, I guess, than some of our other ones, mostly because it's just focused on a single tool, but it fits in nicely with what we've talked about already with 16S, and it's a tool that we continue to develop, so I guess it makes sense that we bring it up here as well. So I'm going to talk about PyCrust and more generally I guess about metagenomic inference. I was originally supposed to talk again tomorrow, and so usually during that time I do myself plugs about a couple of things, but since I don't have the time tomorrow because I'm not going to be here, Gavin, my students are going to give that lecture, but I will just give a quick plug about a couple of little things. So one is I do run a microbiome sequencing service out of our lab called DIMR, or the Integrated Microbiome Resource. We launched it in 2015, mostly to service local samples, there's lots of different places to do microbiome sequencing, you heard about a service here this morning, so we've done quite a bit of sequencing since we launched, so we've now hit 50,000 samples. Those are over a large number of projects and a large number of investigators, and mostly in North America, but also in Europe and around the world. And so a lot of my research focuses on human associated microbiomes, but at DIMR we've basically handled lots of different things, so if you're not on here, then you know you can send us samples and we can have a new logo just for you. But looking at liquid environments like ocean and freshwater and wastewater, but also host associated and food products that are sold environments. That's all I'm really going to do about that because you didn't come here to get paid advertisements, although it's not a commercial company or anything, it's just right out of my lab. You can check out the website where there is prices there and a queue and some idea about what you would submit for DNA. So we tend to take either DNA that's already been extracted is usually what we take, but we will take samples as well into the extraction for you. Along with that I guess was coupled with it was just this need to come up with standard operating procedures for the bioinformatics and also to provide tutorials for workshops like these. So in last year we published this paper called Microbiome Helper, which was meant as this resource that incorporated different aspects. So one is it has some scripts which are basically some glue which do just some basic file format changes between say biome and stamp or biome to human or maybe paralyzes some steps. It also has a wiki where we have sort of step by step tutorials outlined and so those keep getting developed over time and so that's where we're hosting the piecrust tutorial you'll do today and then also the metagenomics tutorials you do tomorrow and then we also do have some sequencing protocols on there. So if you're interested in the wet lab side and how that's done, that's all hosted there. It does get updated and so I'd encourage you to look at that. Along with that we tend to still I guess develop different or test different methods out there. So we heard a little bit about 16S and Chime2 and methods like Data2 for making Amplicon sequence variants. So we have this preprint that we just got revised. If you're interested in it, I can send you the revised version because it's more newer than the preprint that's out there right now. But just where we compare three different denoising methods. So Data2, Dubler, and UNoise. And just compare those to a reference picking approach. I won't go into the details but overall they tend to have pretty similar results on beta diversity. So if you look at a PicoA plot, they look pretty similar or even at sequence composition. But if you're really interested in alpha diversity or the number of species in a sample, then they vastly differ. And so you have to take that into account. Okay, so moving on to what I really want to talk about is piecrust. So the idea here is that if you're working with, say, 16S data, you're gonna do some biathematics which some of you just did in your previous lab to eventually get to an O2U table. So you can see this here, yeah. Say an O2U table or this could be a table of ASVs or sub O2Us, whatever you like to call them. Where obviously each feature is a row and then each column is a different sample. And then from that, you could do alpha diversity. You could look at PicoA plots and beta diversity. You could test individual taxonomic differences. Look at stack bar charts or heat maps, lots of downstream things. And similarly with Medianomics, you can use different methods also to find out sort of who is there using different approaches to find out what tax are in that sample. And so they'll be covered more tomorrow morning. And then of course, also with Medianomics, you find out not just sort of who is there, but what are they doing? So you can annotate genes with particular functional classes. And then eventually get to maybe a table of, say, in this case, these are keg orthologs or they could be EC numbers. And then you can use that information to collapse them into, say, pathways. And then from there, you do similar things where you want to compare the differences of different functional classes. And maybe talk about the ability to have different, the different repertoire of functional abilities within that community. So that's great. So PyCrust essentially tries to predict this table, so these functional table of abundances of gene abundances, from the information given by 16S data from, say, an OTU table. And so this was published in 2013 and then has been used quite a bit and really probably wasn't updated much until just the past, I guess, six months to a year. We're now, Gavin, my PhD student has been working on it and we're starting to push out PyCrust too. So the idea behind PyCrust is really just from a sort of a basic level, is this idea that we're going to leverage reference genome databases. So we have really large genome databases and say Patrick or NCBI or IMG where we have tens of thousands of reference genomes that have been isolated and sequenced. And then for each of those, you could in theory collapse them or you can collapse them into different functional databases and count whether that's present or absent for a particular gene or whether there's more than one copy of a particular gene. So these are, again, are, say, cake orthologs. So K001 is alcohol dehydrogenized, dehydrogenase. And so, say in genome Y, that's absent and genome X, there's four copies of it, right? So that's just what that's representing, collapsing that information. And the idea here is that you would provide, say, an O2U table where in your community within sample one, say you have ten copies of this O2U1. And the idea is really to say, can we predict what the functional repertoire would be of, say, this O2U that's been sequenced? And so if the 16S was 100% identical between this O2U and genome X, you might have a good idea of what the functional ability is of that genome. You wouldn't still know for 100% sure, because you can actually have 100% identical 16S genes and still get variation in functional content. But you would at least know that it's probably a specific species at that level, and you might have a pretty good idea of what's in that. But then what happens if you don't have a good reference genome for that? And say your nearest 16S sequence is, say, 80% identical, then that's a lot harder to figure out what would be in that genome, and then how to basically collapse that information to come up with some community prediction. And so this is what PyCrest is trying to do, although some methods essentially just do this version and then just take the nearest neighbor approach. So that's what, say, PyFillen does, which basically was developed after PyCrest, which basically just takes the nearest neighbor and then just says, that's what the functional content is. But PyCrest tries to predict it with phylogenetic methods. OK. So at this point, I would usually describe more about PyCrest 1, but now I can talk about its limitations and then how we're sort of overcoming those. So with PyCrest 1, you can only use reference-based O2Us. Will mentioned this briefly. This is where you're assigning sequences to a reference database, but then anything that was left over before you basically had to discard those O2Us. So any sequence that didn't match your database, we weren't making predictions on. It also only worked with green genes, and this is because we pre-calculated most of our predictions ahead of time for those O2Us. And so if you used any other taxonomic database, it wouldn't work, which was not ideal because lots of people have their preference towards, say, RDP or silver, especially now that green genes hasn't been updated for a while. Also, more and more people are often doing genomics on the side, so they might have a particular environment that they're really interested in, and they're doing their sequencing genomes particularly for that environment, and they want to help put those into the PyCrest pipeline so that then that will help enable better prediction. So actually just a couple of years ago, a new tool came out called CalPy, and essentially they use PyCrest to, they use the PyCrest pipeline, but they're supplementing it with genome information from CalRumans, and so they're really interested in applying PyCrest on that particular environment, and so they sort of branded it as CalPy. And with PyCrest 1, it was computationally fast, but that came at the expense of it not really being flexible because it was all pre-calculated ahead of time. Okay, so the new pipeline looks like this, and so I'm gonna walk you through the different steps here, but just to draw into one thing is that our input here is both the sequences themselves, so I'm gonna use the term ASVs for Amplicon sequence variants, but these are the actual 16S sequences, and then also their abundances, and beforehand, when you used to use PyCrest 1, you would essentially start at this step down here, this last portion, and then all of this up here was pre-calculated, but now we're basically, as a user, you're gonna run these parts of the pipeline yourself, but it gives us much more flexibility in what you can provide as an input to PyCrest. Okay, so the inputs for PyCrest are essentially a 16S fast file, and an abundance table for those sequences, and the IDs have to match, between those two, those are the only things. So those sequences can be either from OTUs, so you could do any sort of de novo or open reference approach, and usually with an OTU approach, you would pick representative sequences, and if you've heard of Unifrack, those representative sequences are what get fed into Unifrack for your tree that does that calculation. So you can do that approach, and it's important to realize that this is all phylogenetic, so there's no actual taxonomy used at any point, so it doesn't matter if you're using green genes or silver for signing taxonomy, it only looks at the actual sequences, the 16S sequences themselves, and places them into a phylogenetic tree, so there's no taxonomy involved here. Or you can use some of the newer methods, like I mentioned before, that denoise or correct for 16S sequences, so in theory, you would have a little bit better resolution down to 100%, and you can feed those corrected sequences into PyCrust. Okay, so those are our inputs in gray, and then I'm just gonna highlight what this first step does. So with this approach, basically what we're doing is we're trying to take our reference tree, so if you imagine this being our reference 16S tree for all genomes that have been sequenced in our database, we're trying to place these reads from our environment into this context to get a tree that looks like this, and so we use two existing methods, so Papa Ra is this multiple sequence alignment method that's phylogenetically aware, so it takes as a reference the multiple sequence alignment as well as the tree, and places your 16S sequences into this multiple sequence alignment, and then EPA, NG, EPA stands for evolutionary placement algorithm, don't know what NG stands for, I don't know, do you know? Next generation. Next generation, I should have known that, which was, I guess it's still in pre-print in 2018, but just recently, EPA was published before, but the NG version, which is much faster, has just come out, but the idea is at this point, then we have our community sequences now placed in this reference genome tree, that's shown in blue. Everyone follow me so far? Okay. So then the next part is actually predicting that genome content for our community sequences. And for that, you can use different hidden state prediction methods, and each of those cases, what's looking at is for each trait, and so by trait, I mean a gene family, you're gonna give that to the algorithm and you're gonna load those onto the tree, so this is what the algorithm sort of sees. So those are the numbers in red, so absent, present, absent, and then also it can take higher copy numbers of that trait. And the method is trying to predict basically, for these question marks, what the trait is at that value. And so there's different methods for that, maximum parsimony, maximum likelihood, and other methods within different packages. So originally we used to use an R package called Ape, but this Castor package got published this year in 2018 and really speeds up this prediction so it's much faster. And so before this would take basically weeks on a cluster and so now you can sort of do it in real time. It just takes a few hours or depending on how many sequences you have. So right now you can easily choose that option, but PyCross2 defaults to maximum parsimony right now. And so this idea is that then you're gonna get predictions. So this is a fairly easy example, right? So with maximum parsimony for this question mark, you're gonna probably guess a three because that's both the descendants and its ancestors have three copies. This one may be a one or a zero, although it probably weight more towards a zero with higher probability and this guy would probably be two. And then you're gonna repeat that for each gene family that you're trying to predict. Okay. So now the final part is taking those individual predictions from all your individual sequences that you've given the file and then basically just combining that into the community. So what's nice is you can do this both to correct the O2 table for six and S copy number and also to make functional predictions. So you may know or may not know is that multiple many genomes, microbial genomes have more than one copy of six and S. And so that tends to be pretty much ignored by most of field, although there's been a couple of other papers that have acknowledged that and have tried to address it in different ways. But obviously this can really bias your results where if you have multiple copies of the six and S and that would be represented two or three fold times more in your table. So we make predictions for six and S copy number for each of your sequences and then essentially just normalize your O2 table by that number. So in this case, we're just dividing simply by the copy number for each number in this table. And before it was kind of annoying because you couldn't really use this table because it was only restricted to O2 use for reference space, but now there's no reason why you can't use this O2 table for your downstream methods that rely on that. All right, so that makes sense so far, okay? And then the last part is essentially just doing that those functional predictions now. So again, this is an example where we've used keg predictions but these could be EC numbers and this table was calculated in your previous step and now you're just simply multiplying the abundance of your ASV or O2 by the copy number that's predicted in that genome to then get a sum for your actual functional community table, right? So to get this value 13, you're essentially just multiplying this 12345 ASV by this functional number. So two times four, one times one, and two times two adds up to 13. All right, so that makes sense. Bada bing, bada boom, okay. So the major output from PyCrus is then you can either get functional tables that could be keg orthologs and pathways, EC numbers and metasite pathways, and then you can also remove spurious pathways via minpaths. So the major outputs are keg orthologs or EC numbers or cogs. And then to get to keg pathways, we use minpath to remove any spurious pathways which is a bit different than with PyCrust one. And the same thing with metasite pathways, they require EC numbers to say if they're present or not. And in addition to the functional tables, you would also get links to between the tax and the function so you can get stratified outputs. So you not only get a count of the function, but how the different organisms in that sample contribute to that function. So this is just an example for an individual KO across multiple samples. And so instead of just seeing one, the count of that abundance of that particular keg ortholog across these samples, you also get the taxonomic breakdown for that. And that's sometimes really informative. So this, say the second sample looks pretty much like the others, but you see then that there's actually a completely different microbe contributing that function. Okay, and then so for evaluation, just to, I guess to let you know how we evaluate it, the idea is that right now we pretty much always evaluate by using six nest sequencing and shotgun sequencing from the same sample. So some, a few different studies have done sequencing where they've done six and S and med genomics from the same samples. And then we basically ask, if we take the six and S sequencing and then do predictions with pie crust, how well does that correlate with the actual functional annotations from the shotgun medigenomics. And in the 2013 paper, there's a series of figures which I'm not gonna go into great detail except for two and you can look at more than in the paper if you like. But this is just showing in sort of an overview where this is a PCA overlaying both the pie crust predictions and the medigenomic data correlating with that across different body sites from the HMP dataset. So you can sort of see that the colors match up, although you do see a slight shift with the lighter colors which are the pie crust predictions for each of the body sites. And then we also showed that as your sample has higher diversity and especially diversity that's not well represented by a reference database that our accuracy will tend to drop over that distance. And so this distance from reference databases is represented by this nearest reference genome, sort of nearest NST, I would just say for nearest sequence taxon index, sorry. I should know that. And then so this is essentially the weight of all the members within the community to its nearest sequenced reference genome database, reference genome, sorry. And so what you see with human samples you have really good representation within reference genome databases because that's the bias that we sort of sequence towards. So that NST value is fairly low. And then as you go out to really more diverse samples this hyper-sailing that was really acknowledged for having organisms that we hadn't seen well before are out here. And then this is just a spearman with how well pie crust predicted the metagenome annotations. So you can get this NST value with the pie crust predictions for your particular sample so you can know where your sample lies on this plot. So since, so these are just a couple of results I guess that are new to pie crust too. So first here is just showing the null distribution. So null here just means that if you took a random distribution of genes across those genomes as a predictions this is the type of correlation you tend to get. And so that's fairly high. And that essentially makes sense because there's a lot of core genes that you would expect to be the same. And even if you just randomly guess a genome you're gonna get a pretty decent correlation by itself. And this is pie crust one. And then these are some different things that we tried with pie crust two. So PIC is the ancestral state reconstruction method or hidden state prediction method we use with pie crust one. And this is maximum parsimony. So we do see a slight improvement between those two. And then reference sequences and all sequences is reference sequences just means only those that hit the database before and all sequences now is including all variants which we thought would have a bigger effect but it's fairly minimal. But either way pie crust two is slightly better than pie crust one and the correlations are still pretty high. We also compared across a few different methods that had been published since pie crust two. So pie fill and I mentioned before is this nearest neighbor approach. Tax for fun is another approach that was based similarly but it used the silver database and pan FP was another method as well. And all the methods actually do fairly well and so you'll see reports from them about that they're pretty high accuracy. But overall we're still doing better with pie crust two who are shown here with maximum parsimony. Okay and then with that I'll leave you a little tidbit. So people often ask what about 18S because this is all 16S focused and we're not really sure about 18S for now mostly because we don't have nearly as many reference genomes for microbial care ads as we do for bacteria and so we're pretty tentative on this. But interestingly enough if you can plug in an 18S tree and 18S genomes into the pipeline and this is just a leave one out validation where we hold out the genome that we're trying to predict and make a prediction for it and see how well it correlated. So the correlation is really good compared to the null but still this should be taken with a grain of salt we haven't really tested this much beyond that it could just mean that it doesn't really represent well if you sample a completely new microbial carryout that hasn't been sequenced before it could really do much worse. But we are trying to validate this on a community dataset where we have paired 18S and metagenomics that's mostly microbial carryouts. Okay and with that I'd like to present you my amazing new acknowledgement slide which was just generated last night actually yeah it's just last night. So this is an amazing PCA plot representing members of my lab that was just sequenced, finished yesterday. And so this is how we place into our PCA plot based on stool samples. If you come to my lab, you're promised a free sample your first one's free. You too can share in the data plot. I won't reveal any more details beyond that. Just a PCA plot, we'll get into how much acromance everyone has. So these are the members of my lab and I should really just point out obviously Gavin Douglas is the TA here and is actually gonna give a lecture tomorrow but he's been the primary developer for PyCrus 2. So all acknowledgements go to him now for the development on that. He's done a great job. And obviously obviously funding on top of that. So I'd love to take questions because I actually kept within my time though PyCrus 2 before I moved to the lab. Thoughts? Yep. I see papers in the last few years looking at state coordination. Sure. And clinical outcomes. Right, so I mean I think that's what people wanna use PyCrus 4 essentially, right? So my default on that is two-sided. So PyCrus has this love-hate relationship where I think people love to use it because if you have 6NS data it gives you some idea about maybe what functions may be in those organisms, right? And then the hate comes I think from the other side where they say but it's still a prediction and so people tend to maybe get caught up in overstating their confidence in that, right? So sometimes I have slides where I have this nice example where I won't go into great detail but anyway a paper got bashed online because it was based on PyCrus predictions and then it got overblown into saying that a particular metabolite was produced which has been associated with a disease and so it got a little bit thrown out proportion with like oral health and I believe cardiac arrest and then coming up with like a magical probiotic mouthwash that could solve everything. So I mean it's the same as with metagenomics, right? So if you do metagenomics you're gonna get gene abundances, right? And that still doesn't mean that one those functions are doing what you think they're doing because the annotation could be wrong and two they're not transcribed unless you're doing metatranscriptomics which John's gonna talk about and three you're not gonna do met, you don't have them metabolites produced as well. So it's, even if it's a metagenome it's not a gold standard that that's completely linked to the clinical outcome. So you can tell there's a lot of caveats there around what I'm saying but I guess the idea is yes it does give a prediction that's fairly robust and the idea is that hypothesis is generating, right? So then the idea is that you would follow that up for something more targeted for that function that you're interested in, downstream. Does that make sense? Okay, yeah, question here. Thanks for sharing, really interesting. I have a question about the source data as well as potentially in the downstream analysis accounting for uncertainty or variability in the phylogenetic red, phylogenetic? Yeah, so I can repeat the question. So she's getting at the point of certain functions being probably better predicted because either their core is better conserved with six-nest phylogeny versus others which may not be as reliable. So that's definitely true. When we looked in piecrust one at functional categories we did see a definitely a change in certain classes of functions although they were all pretty high although the keg doesn't actually focus much on say antibiotic resistance genes which a lot of people are interested in and then are probably notoriously associated with LGT, right? And so that would be a class that we would presume we would do a bit worse in. And we from a long long time ago I did have suggestion that those results were a little less confident. With piecrust two we are talking about we actually get probabilities for each gene family for each prediction. And we're just trying to figure out I guess the best way that end user uses that at this point. So if you dig into it we can provide the probabilities for a particular function being in that genome. And then we're hoping at least to one have maybe some cutoff to say no this function you shouldn't really use it all. And then also on the other side of it is some ASVs may just be really poorly predicted because one they get placed wrong in the tree or are super divergent. And so we're looking at ways to maybe say no we can't make a prediction on these so they shouldn't be incorporated. But with piecrust one we basically just gave it all to the user and left it to them. So we're trying to make a few ways to give warnings that make sense. Yep. So as you mentioned AMR and I had a question like can you predict AMR or something like more virulence of a particular screen. Yeah. It's something we're interested to know which screens are more dynamic versus. Right. So I'm torn on that answer a little bit. So my PhD days would say no. Like and as a microbiologist I think I would say no not very reliably because I think people will understand really well that virulence can easily be dropped right. I mean you have E. coli strains that are notorious for being super virulent or just a lab strain right as a typical example. But I spent PhD days on LGT where you would see genome I got is coming to particular genomes and they would confer new virulence and that's the same you know it's just another strain and so that's not detected by PyCrust. So I would be weary of that. So PyCrust 2 doesn't make say a prediction on virulence. That being said there is a method called bug base which essentially uses PyCrust in the background to make predictions about gram positive, gram negative and I believe virulence either as virulence factors or something around that. And they show all right accuracy so I think it depends what you're saying. I mean it might work as a community scale I mean people even associate just a larger amount of proteobacteria as bad because they tend to have a lot more of the pathogens within them right? And so if that's the sort of scale you're going at yeah PyCrust would actually give you quantitatively that information but is it gonna say for sure if a strain is virulence or not? Definitely, definitely no. I think association of the genes at least categories that these genes could be involved in. Yeah you can, yeah absolutely. So you can get definitely those predictions. It's just I think it's probably the one category that's really interesting and two that will probably have to I think robustly take a look at with PyCrust 2 to make a better comment on that. So it's something that we haven't looked at recently to look at their correlation to say how well it works. But if for example we have the database of virulence and there are many like Patrick has its own specialized virulence database and virulence vector database can we take those set of genes and put them in the pipeline? Yeah so all you would need is, you'll see in the module but you would provide the pipeline this file here trade values of reference sequences right? So you would need those along with their placement in a tree right? And so the methods would be there to make that tree and then also to provide that but yes and then you would just run the pipeline the same way. And when is this coming out? So PyCrust 2 is definitely publicly available you're gonna use in the module today. We sort of technically I guess are calling it beta. We're pretty confident on its accuracy overall. I know Gavin's making still just a few tweaks maybe to the reference tree a little bit just cause there's some long branches there that you might see in the tutorial but we're pretty confident in its predictions right now and it's just more flexible at this point. So yeah, so it's available now. We're hoping to put a paper out maybe late this summer. So with the prediction is there like a persistent type of error that PyCrust can use? Consistent error? Yeah. Well in which way? So what is, so the accuracy is around 85, 90% right now. So what is that error being made? Oh, so that's a good question. So it's being made by and in theory it's being made by certain ASVs or sequences us not making good predictions on their genomes. And that's definitely true and that's what I would say most areas. The thing is our gold standard in this case is a metagenome that's been functionally annotated, right? And it has, we're not doing like an in silico an approximation which we could do but there's problems with that as well. So our gold standard is a metagenome that's been functionally annotated with humon too and so there's lots of different ways to do that. And so us achieving 100% accuracy is just not gonna ever happen even in the perfect case because there's just differences, right? And so there's error coming from that and the sequencing depth of the metagenome are comparing to. We can get as I mentioned before probabilities on the actual individual traits and try to model that and then provide that to the user in some fashion too but it's, we're still working on that a little bit because it's a probability of each trait for each ASV within your community and then you have to do a combination on them. Yeah, one more and then we'll move to the module, yep. So what is the sensitivity of detection? So what is the lowest number of traits that should be there to be predicted? For 16S? Well in our first paper it was actually really low. So we did a rare faction and I believe the cutoff was 100 before we saw a drop in accuracy. I wouldn't wanna go that low nowadays. I mean it's just, I don't think anyone goes that low, right? With 100 sequences. I mean most people have 10,000 or more so I don't think there's a big factor with that with PyCrus at all. I don't have anything, our gold standard would be great if we had really high depth metagenomes because you require a lot more to get that depth. Okay, so we'll cut off questions now so we'll move to the lab.