 So I'm going to talk to you about PyCross for the next half hour and then I have a lab that's about an hour as well. I don't think it'll take too much time for both of those things. My slides are very short, so I'd encourage you to ask questions throughout. Usually there's lots of things to ask and those are usually the good details come out from a question. So definitely stop me whenever you want. Okay, so learning objectives, basically I'm going to talk a little about how we sort of do qualitative inference all the time, a function from taxonomy. I'll explain what I'm talking about there. Then I'm going to talk about PyCross, which is a method to predict function from 16S data. Then I'm going to talk about some limitations really briefly of PyCross and the major steps of doing a PyCross analysis, which will lead into the tutorial. Okay, so far today you've talked about the 16S ribosomal RNA gene and basically you've all basically maybe got to this step where you've now been able to make an OTU table or you're getting to the point where you can make an OTU table from 16S data. So you could use CHIME or MOTHER to get to this table. And really just an OTU table is a simple table where you usually have samples as different columns and different OTUs as different rows. And then from there you do lots of fun things. You can build PCA plots or maybe make networks or there's a school tool that Robbiko may have ganged us or whatever you're going to do. There's lots of visualization, but that OTU table is often that starting point for it. Okay, so that's a little recap. So what's in a name? So the real question is, is you can do this clustering of sequences into OTUs, right? That's a 97%. And then the second part you would do is then assign taxonomic names to those things. So why would we want to do that? What do you guys think? Why don't we just use, you know, call them one, two, three, four, or maybe we come up with some simple solution to say this cluster is OTU one, two, three, four. Why would we want to give it a real name? Trick question. Sounds tricky, right? Any ideas yet in the back? For biologists? Okay. And why would biologists want to name it? Right, so biologists like, no we can't use OTU names and we'll use like Latin names. Yeah, any other ideas? Right here please. Because there's function involved in the name. Oh, you're cheating. Yes, I know that's right, right? Because I'm talking about function. There's function in the name. No, I'm just joking. Yes. Right. So when we say a name, we don't just say, okay, I know what that name is, but you associate other things with it, right? So if we thought out a heliferax, if you're really into bacteria, or sort of extremophiles, and you would think, oh, those must be cool archaea. And I know that they live in really salty marshes or something. You know, if you saw a pleocoricus pop up, you're like, oh, what are they doing in the gut? It's kind of weird. And they must have had some lettuce or something because those are photosynthetic, and that's kind of odd. And if you saw maybe bacillus, you might think of spores, and spores are cool, and spores are bad, and all that fun stuff. And there's sort of new associations coming out as well. So if you've ever looked at the obesity versus obesity microbiome sort of research, acrimacia, this is genus that keeps popping up that might be really cool. And if you take it, you might lose some serious weight after you eat McDonald's food. So all of these names, you know, they have things associated with them. And it's from our research, right? So we started sourcing certain things with it, specifically with pathogens. And from that, you know, you can sort of do cool things. So there was this neat little paper in 2012. I used this paper quite a few years ago. I'm not on the paper, but it's funny. I used this slide, and then I ended up hiring the first author like a year later, and I was like, that's where your name's from. It was so cool. But anyway, so on this paper, they only did 16S profile. They didn't do shotgun metagenomics. And so they looked and just associated, you know, we know that this taxa is probably a nitrogen fixer, or we know that these types of bacteria are oxygenic and toxic conditions. And then they use that information to sort of track their different samples over depth with that information. It's kind of like a poor man's, you know, poor man's metagenome for functions, right? So you make these associations and you discuss them. And if you look at any paper usually that's doing 16S, they'll talk about, ooh, we saw an increase in this bacteria or a decrease in this bacteria, and then they'll say, and this other paper showed that this bacteria was doing something else, right? So you associate function with these guys, but it's very qualitative. It's not quantitative in any way. So tomorrow I'm going to talk to you a lot more about metagenomics and about doing taxonotic assignment and functional assignment. But I just want to point out right now that when we do functional assignment tomorrow, you'll eventually get to a really similar table like with O2 use. So you'll have your columns of samples and then you'll have some other functional category on your x-axis. And I'll go into more detail. So these are examples of Cape orthologs, groups of genes with some sort of functional association to them. And then you do lots of similar visualizations. This is supposed to represent sort of all that fun stuff you do afterwards, statistical analysis. And I guess what PyCross does is it goes directly from your O2 table to give you your functional table without having to do the metagenomic sequencing. So that sounds great, right? I mean that sounds like a win-win situation. So PyCross stands for that. I'm not going to read it. So it was a paper that we published in Nature Biotechnology in 2013. It was a collaboration between sort of three major groups. So I was in Rob Beko's lab as a postdoc at the time. In Curtis Hutton Howard's lab was involved along with Rob Knight's and lots of collaborators. And then Jesse and Zanbel and I were the two main co-authors on the paper. So there's a site just for PyCross and there's some documentation. Okay, so let's talk about how PyCross actually works. Okay, so you can imagine if we have our 16S ribosomal RNA tree. So each tip in this tree, this is a smaller tree, each tip in this tree represents a single, say, sequence or O2. It doesn't really matter in this example, but it's a thing. And we know that we have a sample of 16S sequence. So in our sampling we find that we've got this guy here out of the big tree. This tree is about 200,000 tips probably now if I had to guess. But some database is bigger and smaller. So you can imagine if you knew nothing about this guy, I'll actually just look at my next slide. So what we're going to do is just zoom in on this section of the tree. So you can imagine if you had a particular thing of interest. Actually, let me just back up one second for you. So if we had the genome, in the simplest case, say we had the genome for this guy right here. I should think of it as simple. Say we had the genome for this guy right here, let me just sample. 100% identical. And we might guess, ooh, we know that this genome has this orthologue or this function and can do these different things and we can maybe get an estimate of what is associated with that. And then you can imagine if we didn't have this genome, but if we had this genome over here, well, that's pretty close. You know, there's a little bit of distance here in some magical way. Maybe we could just take this genome as its proxy. We'll just use it in your neighbor. So PyCros sort of uses that idea and sort of extends it a bit more. So in this toy example, we have a few different things going on. So this is for a single, say K-gortholog gene that we're looking at. And the number represents the copy number of that gene in the genome. So imagine your favorite gene, whatever it is. And so this genome has one. This genome has one. This guy has never been sequenced, so we're not sure of it. And this is our new genome. This is our new sequence, sorry, that we have a sample, but we don't have a genome anymore. So PyCros says that it uses an accessible state reconstruction. So there's a whole series of cool fiber-connect tools out there that focus on building a tree, but on using a tree to then do other cool things. So an ancestral state reconstruction basically tries it for different points at different ancestors in the tree from this train information, what was probably the most likely and most probable or most personal approach to figure out what was that each of these ancestral amounts. Okay, so from this information, the ancestral state reconstruction might say, you know what, it would give you this value or it would give you this value. And it would say, maybe this is one, maybe this is three or something, if that's what you mean by that. And then we take the average between those predictions to find out where a branch is from. And then we also extend based on this branch length sort of a confidence in the whole amount of information. So the whole idea is we're using that information from the whole tree about that particular gene, and then we're trying to make inference about this guy. Okay, so that's for one sort of function and sort of for one tip in the tree. So we basically compute that and then we repeat that for all the functions we're necessarily doing. And so we usually do, we usually use K-gorthologs, but we could do pretty much anything. And for K-gorthologs, there's around 8,000 functions that numbers have gone up, 10 or 12,000. No, we're only 10,000. It doesn't matter, we repeat that over and over and over again for that one tip. And then what we also do is we pre-compute it for all the tips in the tree. So for all the green genes at 97%, we pre-compute that. And it gives us a profile now for every single tip in the tree of what we think the genome might look like with those values. Does that make sense? Sort of? What are K-gorthogs? Keg, yeah, we're going to talk about Keg tomorrow. So Keg is a functional database. I'll talk about it a lot more detail tomorrow, but it's just a way to annotate the genes. I think that's a good holdover till tomorrow. Is that okay? Okay. So PyCrust does two major things. One is, who's thought about O2 tables and the fact that some genomes have more than one copy of a 16S gene? Who knew that? So, yeah, a few people, yeah. So genomes have, they always have at least one, but sometimes they'll have multiple copies. You can imagine that if you sequenced a community and you had a genome in there with four copies, then that's going to be overrepresented four more times than a single copy. Sounds like a really bad idea, right? Most people don't care. It's kind of weird. So most people do O2 tables and they don't worry about 16S copy number variation at all. It's just one of those things. It's really weird. So there was a paper that came out right before PyCrust that showed that it's important to probably do that and they came out with a way to do it. And then when we were doing PyCrust, we said, oh, we could probably predict 16S copy number, just as it's the same as we do with all of our other functions. So we actually estimate 16S copy number and then we allow you to normalize your O2 table to take into account that problem with multiple copies of the 16S gene. So in that first step, basically what we have is our O2 table in the first step and we basically now have our predictions here for every O2 ID. So these are predicted 16S copy numbers within that genome. Do you count that there's variation, like condition between your copies of your 16S or do you just assume that they're all the same? Well, within no to you? Yeah. Or in the genome? In the genome. One organism. So within the organism, so usually a genome would only, there's a finite, there's a... Let me see. So there would be a number of 16S genes in a genome. That might change between strands or species. Is that what you mean? What I mean is like, for example, in one little guy, maybe it's like four copies are different from each other because there's an addition from the copies of ours. Oh, I see. The difference in the actual 16S gene? Yes. Yes. Okay. So that's another problem, but we don't try to fix that here. So that's actually just a problem that's separate from 16S copy numbers. So the problem is that for some genomes, you have the 16S copies and usually when you have multiple copies they're really similar to each other, 99-hundred percent, but for some organisms like heliferax, they can actually be quite distant, like 93 to 95 percent, which then you can imagine if you're trying to bin these things at 97 percent, that kind of is weird, right? So you have some of the sequences going to this O2 and some going to this O2 and they're exactly the same. That's a whole other problem that we're not trying to deal with. But yeah, good stuff. A future tool to make. Anyway, so the whole idea is we have a 16S copy number here and we're basically just normalizing it. So really simply we're just dividing our values from our O2 table by the numbers that's shown in red and it gives us our normalized O2 table. Does that make sense? Rock it. Okay. So a quick question. Yeah. Obviously it makes sense here, but then... Yeah, that's a good idea. I don't know why people don't do it. Okay, so I'll come up to it in the limitations a little bit, but I'll address it now. So one of the limitations right now with PyCrust is that one, we're stuck on green genes database and then the second major thing is that because notice that these have actual identifiers, so these are green genes identifiers. We just can't use de novo to use because we don't have any way to figure out where it is in the green genes tree. So one, we're stuck on green genes and two, we're stuck on only reference picked O2 use. And so most people would agree that's a bad thing now, although when we published it, people were raving about reference O2 you're picking. So the problem is now is that if you did our message to normalize, you would be getting rid of your de novo to use, which is... That's kind of worse than not probably normalizing. So the other approach though by Steve Campbell, I believe, works on de novo to use as well. And PyCrust will eventually, if you want to wait six, four months, but for right now it's stuck here. But yeah, I believe for sure people should do 16S normalization. I mean the argument could be made that it's on predictions of the 16S copy number, but I would definitely bet and while Steve Campbell really shows really clearly in his paper, we didn't really focus on in this paper that predicted copy number is better than just ignoring the fact that there is copy number problem. So... I'm sorry, didn't listen to the whole package. The genome, yes. That was the tree thing I just did. So the whole idea is right because you know that these certain tips in the tree have these certain copy number, and so you try to predict the ancestors, how many copies they probably had, and then you make a prediction based on that. All right? Great. Okay, so second step is where the real magic happens. So it gets interesting. So now we take our normalized O2 table here, and we take our predictions. So now we have every single k-worth log, but different gene clustering and quantification as a column, and we do this sort of matrix multiplication any. So in this example, to get this value, so the result is that we now have this table where we have k functions by samples, and we say we have this 13 value, right? So that just comes from multiplying this prediction by the abundance of that O2 grid. So there's two times four eighths and then one times one is one, which comes to four, so you have eight plus one plus four is 13, right? And so you're just basically doing that as a matrix multiplication across all the columns and rows here, so you get a whole matrix like this. Does that make sense? I mean, the real hard part is coming up with this in the first place, which I sort of just explained, right? So this is all pre-calculated, but when you're running the steps, it's really only doing this, and if this is just a lookup table to figure out how it would have come like this. And that's why we're using the 3D, because that way you can map your O2 to your predictions here. Does that make sense? Yeah. Okay. So that's great. That's how it works. And then I thought I'd just show some of the major highlights from the papers of sort of how we validate it. So the whole idea was we took the HMP and some other data sets, and we did 16S analysis on them, and then we ran five rows. So we generated these really big tables, which is just represented here, the heat map, where we had different functions and different samples. And then you can imagine that then you took the metagenomics, so some of the samples in the HMP were both metagenomic sequence and 16S sequence. So then we used those same samples, and we annotated them, which I'll talk about tomorrow, about how to do functional annotation with metagenomics. And then we asked, how well do those overlap, right? How well does the piecrust predictions overlap with the metagenomic predictions, and then we asked, you know, what's the accuracy? So that's the idea. So we found is that when we look at, say, a PCA plot now, this is just body sites, and then you can see that the colors of the body, the colors of the dots here are different sample sites. And then all the triangles and the piecrust predictions and all the circles are the actual real shotgun sequence. So we're using the real metagenomics as our sort of gold standard. And what we see is that we see pretty good overlap with, for a body site anyway, that the body sites are clustering with their appropriate body site. That's great. We also looked at specifically accuracy. So if you look at, say, our correlation, we looked at different datasets and for the human dataset, we get above, we get an average of about nine of our correlation, with the accuracy. And then on the X axis, what's interesting here, so that each color is a different type of dataset. And what we planted here is the distance for each 16S to its nearest reference genome. So the idea here is that as you come up this way, it means that your sample is more diverse and it's not well represented by genomes in the database. So you can imagine the human gut samples, we have really good, pretty good reference genomes, and so we do really well. This is just the life thoughts. And then if you take a really bizarre sort of metagenome, these are hyper-sailing and we're really well documented for the odd samples. We plot them this way and then we do see our accuracy drop. So if you have some really crazy, diverse thing that no one's ever seen before, then we can't predict it very well. So that's good. So I'm sorry, that's an odd project which will add up to some how do you think it would perform other populations, like kids in North America or some Amazon jungle tribe? I mean, I guess it's kind of coming back to what we just did now that there's a difference. It is, you want your thoughts on that. So I'm guessing a lot of people have used them on the process. Right. So it tends to actually work on these crazy things that people, so at first we wouldn't really, we didn't recommend it too much for soil, although soil did surprisingly well here in the test. But what we do put out is this nasty value. So the nice thing is that this plot's not working. But what you can get when you run your samples is this nasty value. So it tells you based on the genomes in the green genes database, not genomes in reference genomes, the database is sort of, sort of, but the, as the nasty value increases, that means that your sample's more diverse and not well represented by genomes by reference genomes. So some people sort of use this as a, probably it's not a perfect correlation. So you could have a nasty value of say 0.2 or so, and then does that mean you're getting sort of soil nice 0.9 correlation or are you getting, you know, 0.4. And so there's a lot of variance there around that. So we, minimal tools, right? Do you have any thoughts like that? Well, not, you could, yes, in theory. Well, no, because this is accuracy on the Y and Y axis. Well, you get all your samples and you get a whole bunch of nasty values. And then you could say, well, okay, I got a nasty value of 0.3, that's off the scale here. We really shouldn't trust that prediction at all. Whereas this one's over here, we should trust it more. Okay. Anything else on that? Okay, so then we looked at sort of per genome predictions. So we just looked at, say if we leave one genome out and then try to make the prediction for that genome and then compare the accuracy. We did that analysis. And we found that we'd actually see more variation around the tree. So this bar on the upside is the accuracy for each particular species that we tested. And we thought we'd see branches say, for sure, an archaea. We expected this and we'd drop down more. And surprisingly, that's pretty consistent. You do see depths going around areas where you sort of have these longer branches. And then you do see the accuracy drop down. But we didn't see what we expected maybe where there might be areas of the tree that are not well represented. A sharp decrease. So that was interesting. And then last thing on sort of validation was then we asked, well, what types of functions do we do better or worse at predicting? And so this is at sort of a sort of high level. So this is looking at k-pathways. So this is grouping those k-orthologs into more general functions. And overall, we do see that the accuracy is really still pretty high. So even like, you do see some weird variation. Not weird, you see different variations. So like, I don't think the number is a little bit lower, but we're still pretty much around the function. So it didn't seem to really matter too much on these functions. The kig doesn't have a great representation of things like mobile genetic elements or say antibiotic resistance, unfortunately. So we did use another one, another functional annotation system called the seed. And when we did that, we did show that the annotation that we did see a drop in prediction for things that are really known to not only transfer, and then we would then expect that 6NES doesn't do a very good job of predicting it. So, or you just see a lot of within strains, things changing really rapidly, like say, Pat Birland's factors and things like that. Yeah. For the functions, the functions were, yes, this is only HMP data. I'm almost 100% sure on that one. Yeah, so I don't think we looked at the functions from some of the other samples. I think we stuck with the human samples on that. I can check though. Yeah. Okay, this is kind of promo when we were first coming out with it, but one of the cool things we really did was we took the huge whole HMP, and then at the time had 6,400 pretty one 6NES samples and only this many genetic norms. And we basically ran it on PyCrust and it only took 10 minutes and only something had all this cool information. So in the paper you can look and we did some new analysis that pulled out some biological relevant details, but I won't get into those. And then the other thing that we sort of looked at, which isn't super relevant, but I think it's kind of interesting as we asked, you know, if we use metagenomics as our gold standard, you can imagine if you don't do very deep metagenomic sequencing, that's not getting done, but the red line shows that we rarefied the metagenomics and you see that as you rarify after you start getting further away from the accuracy, the accuracy goes down as you rarify the metagenomics compared to itself. But it's kind of cool that 6NES is basically our accuracy stated flat under cross. And then where the likes class, it means that the PyCrust is doing better. So this doesn't imply any data sets now, because of course we're doing way more than 72,000 reads in the metagenome. But funny enough in the 2013 there was still quite a few in NG-RAS that actually had less sequencing than that per metagenome sample, which suggests that those people could just rank PyCrust on their 6NES data and did actually closer to the real value. 16% of the existing metagenomic projects, which is pretty cool. Okay, so I did mention a couple of limitations early, and I'm going to readdress those again. This isn't in your things, because I realize I didn't put them together. So I'll re-mention them briefly, but it's currently based only on reference O2 use. So if you do open reference O2 picking in chime, then you can use PyCrust, but you basically have to filter out all your de novo O2 use, right? And then I'm going to re-engined database. So if you're using Silva or RDP, tough love. I know. Crap. Suck it. And it only works for 16NES data right now. So people have asked about whams if I have 18S or ITS, could we get some functional predictions? And people have been asked about that since the start, and I thought it was crazy a while ago, but I think we might give it a shot maybe for things like yeast or something, but really quickly is when you get these, and this is more to do with functional databases, is you'll get these weird pathways coming out from K, but I have predicted things that, you know, these genes are associated with some disease like cancer or something. I think it's absolutely no sense. And it's kind of a weird technical reason. So it has to do with the fact that K-gorthologs are annotated both with sort of gene human genes and microbial gene annotation sometimes. And so a few things can happen. One is you could just get distant homology, so your genes get annotated sort of poorly to some gene that's mostly in humans, and it's, you know, it's only safe for your 50% identical. That's a bad example, but that sometimes happens. But more likely is that genes that do have homologs and people just get associated with disease and people but those genes do something and the long story short is that's alright, it just means that don't try to say that the microbes are causing disease or some sort of cancer or whatever those annotations are because it's just an artifact of the database itself of the database by itself. Okay. So those are the main limitations. Sorry, on that note though, can you tell us to link your annotations actually completely more, which once, actually these ones have that odd. I don't know that, but I don't understand why it was there. Well, there's this whole, yeah. Yeah, that would actually be really useful database, but that's not what's going on. So there's different ways to group the kegs, the keg orthologs and one of those groupings at the highest levels is called diseases or something, something very general like that. So if you look at your pathways and then just basically remove all those, all those proteins are related in people and have nothing to do with the microbes. So repeat that again, so it's saying there's the highest Yeah, you just can throw those pathways at it. Yeah, I should just remove them myself. I know. Again, TLC needs. Okay, so the cool thing is, okay, so the major tutorial pipeline is you're starting actually with an O2 table that's already been made or you can get to it through the open reference O2 picking and you filter them out there's an explanation somewhere as it explains that. Mother has its tool as well that lets you sort of make a biome file which has greens that are a little tricky but you can get to it. But the tutorial focus is mostly on this stuff so you're going to take that O2 table you're going to correct the six minutes copy number you're going to get this you're going to collapse those in the big pathways and you can push all those into a tool called stamp. Okay, so tomorrow you have me all day so get used to it but tomorrow I'm going to discuss what stamp is and the general introduction I'm going to talk about how these workflows are grouped in the pathways. The tutorial today is mostly meant just to give you an idea of what's going on tomorrow and the tutorial has two major sections one is this and then it says optional you can do this stuff in stamp if you want if you're just like blazing through short give the world but it would probably be a lot better to leave that stamp part till tomorrow when you have some extra time okay so I think for today that's just to focus on the first steps to where you actually make the O2 tables you open up I have a question do you have the this guy here yes yeah I want to like how do you how far are those numbers yeah I've never explained this well I thought it's really tricky so okay so this is the table now this is what we're really trying to do so this is coming back from this tree thing and maybe once maybe going back in time it helps so to get that table we basically say for one particular case I don't know what that is right so you can imagine say we're trying to predict these across all these different OTUs okay so what we're actually just addressing is this number right here this one right here so we're trying to figure out how many of these copies is in this particular OTU okay so that's what I'm trying to show in this tree so basically we're using all the other information with all the other genomes associated with other types of tree so what we're trying to do in this hole in the in that table and to do that we map reference genomes right so there's big databases with reference genomes that people already see what's genomes and when we go in those genomes they say oh well in this genome we have four copies and in this genome we have one so that's known information and then we're just trying to use a method to predict what we think is in here the simplest case is you could just say what's the nearest neighbor we actually did that and we do pretty good without doing all this ancestral you could just say oh this one's actually closer yeah you had to explain that I just thought it was referring to the OTU copy number right so this is referring to the cat yeah it's both actually okay so you use it for both yeah exactly cool all right it doesn't really matter so in a harder part of us doesn't really care what these things are it could be K-1 blocks it could be it could be anything and so in one thing we say okay it could be 6-10 copy numbers so that's one of the traits that we try to predict any other questions okay so give the tutorial a whirl the tutorial's on sort of a there's a link to it and it's on a slightly different website which I'll talk about tomorrow as well about microbiome helper this is just like a little small section until tomorrow okay so again you can raise your hands or use the sticking notes I guess for questions yep yeah that's right yeah you can imagine a simplest case you know if you had perfect all the genomes you would just last your 16s against the database of those genomes and then you would just do this yourself right it's kind of looking after that overhead and trying to figure out the best way to do that prediction sequence more genomes yeah and absolutely so we're trying to make it a little bit more flexible so that at the time well it's just changes really quickly and they're actually doing some genome sequencing or culturing right on a particular environment that they're interested in and then they're asking you know can I use those genomes really quickly because I don't want to put them in well they're not an NCBI yet or whatever how do I incorporate those so we're trying to make that more flexible and then the other big one is is more challenging but is we're going to use genomes from those metagenomes and then use those to help guide your 16S prediction but that's that's in the future for a Ph.D. student in mind so we'll see how that goes but