 Okay, so we are going to talk about more metagenomes, but instead of just looking at taxonomy and organisms, we're going to venture into a new area, right? I mean, you've been talking about O2s and species and strains and genera and all that fun stuff, but who likes taxonomy? That's right. Nobody does. Hold on. So we're going to talk about what I'm calling functional annotation, and I'll go into more detail about that. So the learning objectives for this half of the day is to really determine what really a talking vote between taxonomy, composition versus functional composition, what kind of a talking vote. I have a general understanding of some of the different functional protein databases out there. I understand the pros and cons of assembly and metagenomics and also for gene calling. I'm going to try to advance both of those, and we can probably have a conversation around that a bit. This is heavily, in my opinion, all right now. And then be able to actually functionally annotate your metagenome using tool called Qmong, which I'm going to be covering today. And also you're going to be using Snap again in the lab, which obviously you're already going to introduce too. So you should be pros, and you'll see how it can also be used with functional features as opposed to organism features. Okay. So taxonomy, composition, really answers the question that we talked about. It's sort of who is there? What are the organisms there? Where is functional composition really answers sort of what are they doing? And it looks at the different genes and the different data, the different protein coding genes within the whole metagenome as a community. And obviously metagenomes, the word metagenome comes from multiple genomes coming together in a community, right? So that's the same sort of information you get from a genome annotation project where you figure out if the genome of interest is for synthesizing or fixing nitrogen or metabolizing antibiotics. But instead we're looking at this at the community level and we're asking, you know, how much potential does this community have to do these type of functions? It's really the interesting biological questions and how do we get there, of course. Okay. And so what do we mean by function, right? So in really general sense we could be talking about really general categories like odysynthesis, nitrogen metabolism, sort of glycolysis, things that are just sort of generally thrown out there. It's not very precise. Or we could really be talking about specific groups of orthologs, right? Like particular gene families, right? So we could be talking about particular nitrogen infestation, we could be talking about the EC number, which is just a different type of classification for enzymes. One of my favorite, alcohol dehydrins, which fits a lot of these in my body sometimes. And every talk should have at least one alcohol dehydrins. Okay. And this could be a different database. And this is a particular cake ortholog within the cake database that we're going to talk about in quite more detail. This is just an identifier for a particular gene family. This one represents immunorated kinase, right? So these are very specific functional sort of features. Or we can collapse them into more general features and talk about that. We're sort of going to see how those play together. Okay. So we thought that there was lots of taxonomy databases and tools. The list of protein databases is very large as well. And so we can talk with these in a little bit more detail. I sort of listed the ones that I thought were most relevant or popular or in between those. So we'll just walk through these fairly quickly. Has everyone heard of the cog before? Oh, that's awesome. So a cog database is kind of one of the older ones around. You'll see it sometimes described in papers. And people will color their genes or something by cog categories. And it was first put out by the NCBI in its dance from clusters of orthologous genes. And it's really well known. And it's still used where people map genes sometimes too. Because it just has nice sort of general categories like unknown. I think there's pathogenic factors maybe. But it's a nice little hierarchy. It works fairly well. The actual cog database hasn't been updated since 2003. So it works as sort of a hierarchy, but obviously no one's putting new genes into this on a sort of regular basis. So people sometimes map to it. It's okay. I would write it also as you shouldn't do that. Okay. The next is a C database which is used by Rast and MG Rast. So MG Rast is obviously the server we've talked about a couple of times. And Rast is the genomics-only version. So if you're annotating genomes, there's a tool called Rast, which came out way before MediGenomes. MediGenomes. And before MG Rast, the C database basically is a system to annotate proteins using these seed subsystems. And they have their own system for mapping genes to these things. And they relate them together in a hierarchical structure. I'm not going to go into super detail about anything that you can tell. So PFAM is more focused on protein domains. They do have some full-length feature proteins, but it tends to be more focused on smaller proteins. That's the PFAM. The agnore database, which is kind of related to COD originally, but it's very comprehensive, I think is still being updated. It's an automated method for clustering proteins into groups and sort of keeping those updated. It's one of the larger ones out there for 190,000 different protein families. The UNRF is sort of a unit product, which I think Rob and maybe someone else mentioned the other day. The UNRF basically clusters their proteins down into different levels. And this is quite useful. I think it'll be more useful in the future. But they have UNRF 100, which just basically means that two sequences that are 100% identical between them, they get collapsed into one family. So they're all unique. UNRF 90 means that 90% identity, that they collapse anything within 90% of it to each other. So it's a less comprehensive database at that point. And then they have a UNRF 50, where they collapse anything within 50% identity to each other. The nice thing about this database is they actually link these guys to organisms and then link from here to the other clustering as well. It's automated. It's put out by Dover Runs Uniprot. It's a DPI, I guess. Anyway, they have lots of money, so that's good, so it keeps on going. That's that. K, I think we've heard of K before. It's very well-known, very widely well-known. It's very popular. Each entry is sort of really well annotated. And I think what made the K really popular was its manual creation, but also the ability to put these K-gores loads into a metabolic network and then even associate nice images with that. It's probably seen on paper before, even if you don't know what the K is. A nice metabolic image with maybe certain genes highlighted in those pathways. So that's the K. We have been using the K sort of in our lab in the lab associated with this lecture. It did go private. It's quite a few years ago now. Four years maybe. So it used to be completely open, and then they closed it and you need a license to get the full sort of download. But you can still use the website for a lot of things and you can still search the tools, but it's not sort of being updated for more public use as much. So there's some movement away from it, but it's still at this point, I would say, one of the almost well-known things. That must still be well-used, I think, to say in the intro. And then there's Metasite, which has been around for a while. It's starting to become more popular, and I would say maybe it will start to replace K in the future. The Kag contains microbes and humans. And I would say there's maybe not a micro-focus in the K. Whereas Metasite is a micro-focused database, and I think has a better handle on a lot of the different functions within microbes. That being said, I've tried to use Metasite several times, each time I end up banging my head against the table a few times. I think it's slowly getting better. I think they're realizing they need to sort of show people how things are done. If anyone has any, does anyone have any impression about any of these tools? Who likes Metasite? Anybody use Metasite? I think you might have said what you do. Yeah, it's good. This is interesting, but there's a version of the column of Metasite actually running off of the University of Minnesota. That is biodegradation centric, so it's actually zero biodegradation, but they've also put it in a reaction to the environment. I don't know if anyone's tried to type that into legislation. Right. So in general, there's lots of different sort of functional database or protein databases. They all have different protein cons. I'm going to focus on K as we walk through that, just to give you a sense of how these things tend to be organized, but obviously they're all a bit different, and things change in the future as always. Okay, so we're going to do a few slides on K just to get you up to date of how things relate to each other, otherwise to make sort of any biological inference is pretty tough. So we will focus on using the K database to run this workshop, and at one level, there's what's called K orthologs. These are sort of the most basic clustering of genes, and so within this are the most specific. So within a K ortholog, these are thought to be homologs. Anybody want to give me a homolog definition? Why not? Anybody know what a homolog is? Same gene. What's that? Same gene, shared common ancestor between the genes. Anybody want to say different to that? Or once? Or twice? Sold. Yes, so any two genes that have a common ancestor, or ancestry-related phylogenically are homologs. And orthologs are genes within different species that are ancestry-related to that species. So these genes are thought to be homologs or orthologs, and because of that, they tend to be thought to have the same function. So these genes are really thought that if this gene is identified to this ortholog, it's really doing the same function, the same reaction. In the K database, there's roughly 12,000 KOs in the database. That number might actually be higher. I'm not sure where I pulled that number. It might be more like 15,000. But we're still talking not in the eggnog head, like 190,000. So we're magnitude lower than this. The nice thing about this is that the K organizes these KOs, these K-board draws into what's called K-modules and K-pathways. And so if you pull up sort of a chart from the database about a particular K-board's log, in this case you get the identifier for this guide. And K-board's logs may map to one or many pathways. So you see the pathways I was talking to. It's also linked to modules. I'm just going to talk about what these two are in more detail. But you also associate some of these things to different types of hierarchies and human stuff or human genetics, which were not certain diseases. Okay, so K-modules are mainly defined functional units. This is kind of what K calls them. There are small groups of KOs that tend to function together to do something. These are really your terms. There's 750 different K-modules. But the idea is that they all have one sort of function. What's needed about K-modules is, so this is an entry for 1002, which is the four-module involving the three-carded contents involved in this hypothesis. And then it has a list of KOs, identifiers. So this is a bit small. Can people see these? I mean, you don't have to see the numbers, but you can see that there's different K-board's logs made into it. And then they have this figure. So what's kind of needed about modules is that to get from here to here it has different options, right? So different KOs can do the same function as well. So we have this KO here at the first. And then what's needed is that you can either have this KO or this KO to complete the reaction. And they have a definition for what completes a module. So it's not just having all the KO's. You don't necessarily need all of them. You just need to use this one and this one to get to this step. Plus this and this. And then you get to this step one of these guys. And then you get to this step and then one of these guys down here to complete the module. And so if you had some mixture of these KOs and that module would be considered covered and it would form a complete module. I'm just telling you this so that you can understand where they're in. And the tool that you're going to be using sort of does this coverage for you and calculates if a module is filled or not. It's not just looking at the KOs or all the clusters. Yeah, there's a question back there. So if I'm just going back to K and then I'm looking it's a different pathway. No, this is like one pathway that's showing you different options for completing that part of the pathway. So one is the right? Yes. So this means you need this for sure and then you need one of these two. And the other notation I use is like the graphical formula which is nice to look at and then I use these parentheses over here which is actually what is written by the computer algorithm to say, so we need this one we need one of these guys plus one of these, is that what I said? Yes. Oh, maybe I don't know. Is it that pathway? I think it's either, sorry, yes, sorry. You can either have this, yeah. So it's this big one will complete the pathway or you can do this but fill this pathway you didn't want to do this. No more complicated. Yes, okay. Okay, so those are keg modules and then keg pathways are even larger, sort of large, nicely things we like to put a name on but they're not really cohesive in units. So keg pathways, groups keg weather logs into larger pathways which is approximately 230 in the keg the nice thing about pathways is that each pathway has a graphical map within the database so you get these really nice images which are really clear and you can see them better and then there's, you can highlight particular genes within a pathway but they're, they tend to be very large. Pathways can be collapsed in a more general term sometimes but those terms are usually not super useful things like amino acid metabolism or carbohydrate metabolism but if you look at metatomic papers they say as we saw, you know, carbohydrate metabolism increased in the gut compared to the skin HMP paper nature. Awesome. Okay, so those are sort of the protein databases. Do you have any more questions about databases? Because I'm going to move into sort of systems that use them. Is there a database that I haven't mentioned that is your favorite? Why didn't you mention anything else? Everyone's happy with that? Everyone's super sleepy after subway I can tell. Okay, so, so mentioned on the annotation system so this is like how we do classification for taxonomy but we're using functions. So again there's web-based methods and I've listed three different methods here so we've already heard of MGRAS I don't know if IMG slash M has been mentioned. I don't really know so this used to be sort of semi-open resource where you put your data for the WP effect. Now I'm required to log in and I'm not really sure this is about anybody can get a login or you have to be within a club so if they might know about GDI and IMG looking at people in fact. Anybody can get a login? And everybody can load their data and get it back? It's easy to use. And what about using their actual pipeline or is that only for GDI, metagenome sequence? That's the impression I got. They what? I think it's one of their own. They're own, right? So here are the GDI umbrella I know you have metagenome but they'll blow off CBW stream. That's good to know. EBI has metagenome server as well this one is public as far as I know this is sort of relatively new from one of the bigger centers and I don't know a lot of that as well but if you're in the sort of I'm going to throw my data on the pipeline and see what comes out the other end I think it's worth a look at. This morning I managed to do some local database ones where you're actually installed on your local computer. Megan we talked about already but Megan does functionalization as well as well as the taxonomy. Clover I also put this in here so you boot it up and it has its own little Linux server that runs on your laptop it contains a nice standard operating procedure and sort of how to go through it it hasn't been super maintained as far as I can tell and the last person I talked to said it was pain in the butt to use but that goes along with the most pythonic tools so again I'm putting it here as something you may want to look at as we're looking for sort of a graphical interface and then the more nitty gritty local based stuff which we're obviously tailored towards more here at CDW that's where all the cool kids hang out so local base I listed MetaMOS which is a really hands on if you really want to play with different settings it's a pipeline tool that starts with filtering the raw reads does assembly which we'll talk about in a few minutes if you like to do assembly it keeps track of where those reads map in the assembly can annotate using different methods put different options in for annotation different databases lots of variables here and you do lots of different outputs the problem with a tool like this you're thinking why don't people just make tools that lets you do everything at once the problem is it's hard to maintain and usually it's fairly buggy so I've attempted to use this a few times and I could never really successfully get from one end as a little while ago it might have improved genome assembly or metagenomic assembly I would start with MetaMOS I think it's a good spot that includes different types of genome assembly for metagenomic assembly very popular in the functional space is doing yourself so people just make their own pipelines and I think that's actually very very popular so for taxonomy assignment people have kind of like different methods but for functional assignment that they're interested in they're Uniprod or they're they run their own BLAST whatever thing they do they take the top hit bam it's done is that not as good as some of these other methods I don't if you look at what we're going to talk about next is Humon you'll see it's sort of the extra few steps they do and then you can ask yourself is that worth is that a lot better than if I just did a single BLAST and then took the top hit usually I would think yes because anything where you do it yourself you introduce bugs and weird results and then also I found that if you just do your own thing it's harder for reviewers to really tell if you were doing the right thing sometimes so using sort of a standard pipeline is sometimes a way to go okay so okay so I didn't include the citation of Humon sorry about that again Humon was used by a bio-project obviously he was used on humans it's been applied to lots of other samples so you shouldn't feel like you can't find it on your soil samples or your ocean samples it's based on the keg database right now on the back end so you shouldn't feel shunned if you have other samples that aren't human based that's what I'm trying to do and it is a fairly large pipeline here and I'm going to walk us through each step it's going to get maybe a little more technical than we did this morning but I think it's worth trying to show you some of the things it's trying to do in the background so I'm just taking the talk here okay so the first step is that we have short DNA sequence we do some QC steps they tend to be human DNA beforehand have we talked about all those human contamination besides the fact that it happens you talked about it Will? everyone else says no so unfortunately I don't cover it in the lab but I will mention it now so obviously with metagenome data we did all the DNA right and so you can get contamination contamination say from humans or from other species you're not interested in right so if we had human samples we might not want to characterize the functions within the human DNA right so typically what people will do is they'll use bow tie and map any of these reads against the human genome first if you had a mouse they would use the mouse genome if you were doing squid you would use the squid genome or whatever you like right and that just filters those reads out so that you're not functionally annotating something that's not really in the microbiome it's just essentially contamination and that stuff is pretty straightforward just using bow tie anything that hits the genome is screened out at this step alright so the next step is a translated last against the functional database in this case it's the kontology so this first step is the translated last so before we get there people have used BLAST before right do everyone know what BLAST P is protein query it's a protein database so everybody know what BLAST X is nucleotide to protein perfect so the idea here is that we're translating the nucleotide sequence into six frames right three forward three reverse taking each of those translations and then doing a BLAST against the protein database so that's what we're doing at this step so initially lasting against the database which in this case I think is a range of 20 to 30,000 sequences and then if you have no no sorry it's in the hundreds of thousands sequences and then if you have a mysic run of 25 million reads that's a lot of comparisons right BLAST is not going to finish anytime anytime soon on normal server so this is when some new search tools came out this is pulled from the paper called Diamond so before there was BLAST X then there was a tool called RAPsearch and then RAPsearch2 which was fairly fast so this is meant for doing translated protein database searches right now just look at these are the speedup so on the y-axis you have a number of times faster these tools are going to compare to BLAST right so BLAST access RAPsearch is in blue here so that was about 100 times faster at least Diamond came out and sort of blew it out of the water 20,000 times faster supposedly on some of these data sets they have a fast setting and a sensitive setting so the fast setting is significantly faster point thousand there's a sensitive setting you're back down to 2,500 times faster so what this allows is really fast searches they show some accuracy here but it's also been tested with some colleagues in mind that basically anything below 80% you can really lose sensitivity so at BLAST you can reliably find proteins 80% identical very well whereas Diamond would just fall apart it wouldn't do a very good job it wouldn't return you back the same results at BLAST over 80-90% for sure Diamond's going to give you the same top hit at BLAST so you're sacrificing obviously accuracy but it's more for the more distant proteins and since you're sacrificing that sensitivity you get mass speedups so a search that takes about a minute for Diamond would take about 14 days for BLAST it's not actually they claim that they pay for 20,000 times but in reality what do you do the strategy we use is we try to use like EWA assign any type of sodomy we found anything we missed to BLAST tanks anything that passes with it BLAST in the type last time like just continue BLAST BLAST then anything is as equal to BLAST X so I had a question with Diamond if the number of reads is very small Diamond is extremely smaller than BLAST X so it's a really context based so for large sample probably Diamond may work fine but if it's a smaller sample you know the indexing takes a longer time oh yes that's up yeah I know I agree so if you did like a single like if you did 10 equations 100 equations it's gonna take longer if you run a large number of reads through it Diamond is awesome I think it's useful too because it allows it to happen faster as with anything there's small drawbacks yes but even though it's only 10,000 times faster it's still minutes it's taken days okay so we're gonna be using Diamond in the lab later today it's basically the nice thing about it is it's a simple replacement so the same sort of command you would use with BLAST X and the same sort of output a little bit of change in the options but the output looks identical so it's kind of nice as a substituted many different pipelines where you have BLAST X or RAP search 2 these methods and sort of major output identical so that you can then use the same later pipelines okay so the first step is that translated BLAST okay I'm gonna use Diamond in our case then you get this raw BLASTed table with a tab to limit its format then you give it to these steps so the next step is to weight those hits so what happens is you have your BLAST hits BLAST results coming out right and so you have these BLAST results and then the idea is to weight the higher bit score by a larger weight and so what you do is you take the relative buttons of each KO so you don't just count each hit as one you weight that one by the inverse p-value that is the inverse of p-value they said it didn't matter too much they tried e-value they tried bit score they all sort of gave similar results but they went with the inverse of the p-value so it just means the more significant the p-value the larger the weight is for that hit so that means that if the sequence hits something not all hits are strictly correct they're weighted by their significant the other big step here is that they're normalized by the average length of the K-Boards block so you can imagine if you had really big proteins the chance of hitting a larger protein is more than hitting a really small protein and so to account for that difference they look at the average length at the large of each group of proteins and they they normalize by that length so that the little proteins don't get unrepresented in your functional database makes that so far fairly logical right the next step they do is they use what was already published called MinPath where did my text go where did it go is there anything on your slides yeah yeah that's it okay I must have all this animation okay so so the idea here is that you're trying to reduce the number of spurious false positive pathways so what happens as I already mentioned is each of these K-Boards logs can be mapped to one or many K-Path ways right because they do different things in different systems and since you have a hit to a K-Boards log it doesn't really necessarily mean that every one of those pathways is going to actually be in the community and the idea is that if you had a pathway that had say 20 different K-Boards so different steps along the along the pathway but only two of those K-Boards are preserved in the community you're not going to want to count that pathway as actually being present in the sample and so MinPath embedded within this human tool tries to minimize the number of pathways based on that information so it says these K-Boards over here they cover most of this pathway over here so we're going to say that they're mostly in that pathway and these ones that are not above some meaning level so that there's less false positive okay the next stop is this taxonomic limitation I really like how they got this picture of the dog it should have been a horse actually for my stuff so again this is to reduce the number of false positive steps and also the normalize by K-Boards this steps a little sketchy to be honest but what they do is they look at the genes, the K genes that you found that line up with your metagenome and they ask if the organism that that gene was in if that makes sense so if you had dog in an ocean or you have other odd information it asks if that really is there and if that gene should really be called within that organism this makes sense if we cover dogs and cats not so much when there's particular bacteria so K-Boards law has been assigned to this bacterial group should you really say that since you don't see it from your K-Boards law that that's really not there I don't think so but this plays a fairly minor role in the normalization step so anyway the main goal here is that pathways are not found in any of the observed organisms and are made up mostly of KOs mapping to a different pathway group so they try to remove pathways where they don't think it actually exists within okay and then lastly they try to do some normalization because sometimes K-Boards law appears copy number within a genome and then if they they try to actually estimate the copy number within that genome and they divide by that genes again the steps are little sketchy so this means if you had three the same gene in a genome they would divide your hits by three for this for this gene to attempt to normalize for it okay and then the last step is sort of the opposite problem where they haven't estimate enough KOs sometimes they account for the fact that other genome sequencing is not saturate usually the full sequencing depth and so it's understandable that there could be missing KOs just because you haven't sequenced enough right so we can't cover the whole genome sometimes and so if there's some KOs with really small abundances even though the rest of the pathway seems to be as abundant they'll actually increase the relative abundance of those KOs within that pathway to complete the pathway okay that's the last step of the crazy pipeline so the question is maybe could you just take the top in would it matter? I don't really know but we'll use this pipeline as the other one and it does several things that do make sense okay so out of this pipeline you get two major types of information you get module information and pathway information and you get what's called coverage which is basically absent in our presence in that community so if a module is present it's been found to be completed that's just a one or zero that information doesn't really make a lot of sense because this is over a whole entire community the more interesting stuff is the pathway abundance so the relative abundance of that pathway and so this is an actual number that's sorry that tells you the relative abundance of that pathway and module which you can then compare across different samples any questions about Qmallon? yeah I think there's several steps in here that make people uneasy the only thing I can say is the defense so yes I agree it feels weird to arbitrarily increase the level of these low KO's to the median level of the rest of the pathway when just because they think that's what you need the only small defense in their factor is that they did test most of these things to see if it improved their accuracy on actual data set right so they did assimilate data set and they asked what happens if we do sampling and we sometimes missed KO's so we did it on a mock community as well with actual way to meet the genomes and they sequenced them and then they knew what pathways were on those and so all these things were sort of tested against that to sort of optimize the accuracy of that so there must be I would hope some truth to it but I agree that I think what you do have to take into account a little bit the numbers that are coming out of these are relative abundance you'll see that they're really small so these sum up to 1 across a whole sample is 3 or 0.1 to the minus 6 or something which is fine it's relative abundance but that doesn't equate any way to number of reads within that pathway you've seen how it's divided by the length of the gene it's been scaled by an impact it's been messed around with here the number doesn't equate in any way to the proportion of actual reads mapping directly to that number it's their estimate of what really is the functional capability of that sample based on the reads mapping it's a distinction but I think it makes you have to sort of keep track of it I agree yeah I think what happens is if you want to talk about pathways or modules you have to figure out some way to count these things so if you have a pathway with 10 genes in it and you have 5 reads mapping to this ko and 5 reads mapping to this ko within this pathway what number do you put on that pathway is it 10? is there's 5 and 5? but you really don't have any of the other ko's of that thing so I kind of understand what they're trying to do but it's also, as a biologist it's kind of scary when they're doing the steps of altering the numbers yes, so I agree it's sort of reconstruction I mostly bring up the steps that you think about some of the big ones that people tend to do are really the gene normalization which makes sense but you don't want to arbitrarily because of the sequences you have most chance of hitting certain genes I think that makes a lot of sense maybe the gene size but some of these other ones are a little wishy-washy I'm not a huge confidence gainer for a tool you're going to use but anyway, that's what we're going to do ok, on to other controversial things ok, so what about assembly? so for people that don't know what assembly is assembly is often used in genetics obviously we join raw reads into context which are continuous pieces of DNA and then those context get put into scaffolds which are chunks of context which are sometimes with gaps so obviously we have DNA fragmented we find overlaps between the reads we can do this because we have good read depth, 10 or 20 times over the same region so we know that these things fit together and then we assemble these overlaps in the context using nice fancy algorithms and the room graphs and then we assemble these constants in scaffolds this is how it works for a genome the problem is, is what happens now when these aren't often the same one and you throw a genome assembly tool at it and once you've a massive problem like this you assemble things together that really weren't the same genome the assembler has no idea that the reads don't come from the same genome it's just trying to join things together ok, so let's go through some of the pros and cons and people can weigh in on if they agree or not I'll take opinions from the back as well ok, so pros for assembly one, and this was actually one major reason why we used to do assembly was it actually helps with computation time so if you collapse these reads down into sort of one instance that similarity search, back when we used to have used BLAST anyway, is much faster because now instead of doing the 40 or 50 or 1000 megs that have the same gene you're only doing that sequence similarity once and so people used to do assembly just to speed up that process assembly takes time as well and memory, but before done it didn't really do blasting 25 million reads against 100,000 database, so it was used for that for that reason sometimes the other reason was that reads used to be shorter, so they used to be 50 base pairs or 75 base pairs and when you're doing a translated BLAST that gets divided by 3 and now you're trying to find sequence similarity based on 20 to 25 nucleotides long, right that might not be really long enough protein sequence to really say this came from this gene that's a fairly short sequence as the reads get longer now that probably goes away if we're at 300 base pairs that's 100 100 amino acid search and that's a lot better as well now and this is the big the biggest pro right now I would say is that people are using them to reconstruct genomes which I really mentioned briefly this morning so the idea that if you can do assembly then you can say that these contigs obviously these large pieces of DNA really do belong from the same genome and that's really interesting because then you can start to say for this organism that we've never cultured and maybe it looks really cool we can say that ooh it's doing this, this, this this caveats to that is that those things might not be really real, right we don't know for sure that those reads always came together or it could be clamping different strains into a single genome, right so you have to take out the grain of salt but that being said there was a paper very recently showing new genomes that were found from a new file potentially and they used genome reconstruction using assembly to say what they think is odd about these genomes interesting paper and I think it's really, it's hard to say at this point whether there could be false pods or not but maybe new sequencing down the road where we actually do have a lot of reads we can actually test these assemblies to see if they really are clamping things from different strains different isolates together okay so those are the pros cons in a bit longer here but let's see so obviously this is what I just sort of said that the reads are not often the same genomes that come here as conform basically genes, reads that don't belong together are put together artificially you have to keep in mind that when you're doing this genome assembly that read depth is often not as deep as in genomics where we've sequenced it 20 or 30 times we don't have that sequencing depth a lot of times so the genome assembly tools could be limited really their power anyway five dot organism diversity can cause somebody to fail so this has been shown a little bit in this recent paper as well where you have hundreds of thousands of different things in your community an assembly program just can't handle that very well if you sub-sample down to the point where maybe you only have 10 or 20 different things in your sample or maybe you had a low complex a low diversity sample to begin with then metagenome assembly might work better if you only have three or four organisms there's a good chance you can reconstruct those genomes very well because there's not enough diversity thousands of genomes, tens of thousands it's going to get harder also from students okay and yeah this is kind of a low minor point but sometimes when people are doing this to save computational time they were doing assembly and then they were calculating how many genes were found in this assembly because of relative relevance but you have to take into account that it's collapsed all these reads so you lose that information metamos does a really good job that's that pipeline that actually keeps track of those reads from before and then maps them back and adds the counts for the reads going back into those contexts so it's just something you have to be aware of but it's an extra step if you do use assembly and then the last really big kind of thing that I'll do with all baby things but enough of that they do do assembly for whatever reason it can really bias your results on what you're saying functionally so the idea is to functionally say what your community is doing and you want to say this community or this sample has two times more ability to break down this sugar right that if you do assembly first you're really biasing because you could have some bias where you tend to only annotate you only have to assemble those organisms that do that process anyway and you might not annotate the smaller reads that didn't go into that assembly enough that worked so what do you do with all those reads that fail the assembly do you try to annotate those separately or do you just leave those off to the side and only functionally annotate the assemble part so if you're trying to just look at global coverage the assembly part might be hard but is it useful absolutely for doing geometry construction for trying to link things together you just have to keep in mind that it's not I would say that if you can avoid assembly it might be better any thoughts on this often I did it by doing some QPCR as well and even some walking so inverse PCR from the pieces that you're putting together to prove yeah absolutely I think that's what's probably good if you're gonna claim some new genome and I can't I think they did do some QPCR and see if it really was from the same genome um um I don't understand what you mean sorry you know that there's something like Mugger which is basically assigning reads to the genomes basically imagine I know that the rest of this has been very terrible where you're assigning reads to a genome or your in this case you're assembling a genome and you want to use that as a test of how well it works yeah you can but I mean I don't think it's going to tell you for sure that you so you can look at read depth right so if you have a contig and then you have three times as much reads mapping to this section of the contig and only a third less over here than that suggests maybe it's not really together yeah I think that's a good idea I think some people that do assembly of genomes have attempted to do that I can't say any more on that either can you use an idea of what the no what do you mean sorry I mean how how primal it is when these genomes don't think there's been a great estimate so the people are doing genome reconstruction I think their goal is to get longer and longer contigs and I think they do sometimes do in silico stuff to make sure they're not getting chimeras but I think what's not appreciated is that at any level you're getting chimeras right so if you have any sort of strain variation the assembler is going to collapse those together right especially if they're in equal populations so I don't have a good sense of how problematic it is the size of that worries me yeah Rob can't hear you you got a lot of strain so so let's remember we had a PhD student who was dealing with this fairly simply I can make up the name of the camera this is something we're something about and we're doing very similar strains especially with that when the kids were at a genealogy they had a read depth of about 50 times for most of the people and then 25 times was very different if you can have a sequence of one of the genealogy in different cultures was able to do sort of a subtractive comparison so if you have a class based on the diversity then there's something it also depends a little bit on the assembler that you use so I would call some short sequence what is it all trinity yes so it's very dependent obviously on the number of things in your sample and about what your outcome is I think it's worth it if you understand that you're really trying to reconstruct something and you have to check for these time errors and you have to double check and going through that process is it worth it absolutely when you can't culture something and you think you have something really cool you want to describe I think it's worth going after it's a standard sort of step pipeline like this I'm not as huge a fan on it but I think it reduces a lot of biases okay the next little thing is what about gene calling so normal this comes from genomics as well so usually when we annotate a genome we assemble first and then we do gene calling where we try to predict the start and stop site of a particular gene and when genomics first came out assembly was really big and gene calling was really big and I would say gene calling in genomics is almost dead and I don't see too many people still doing it but we'll walk through the pros and cons anyway so the idea here is that the pros of doing some type of gene calling on your metagenome by itself is that it may result in less false positives from sort of annotating non-real genes right so with BLASTX you're just basically translating in six frames some of those might not be real genes anyway and then you're trying to assume function to a piece of DNA that's not really gene in the first place and another pro is that it basically lowers the number of annotations comparison later on right so instead of doing all your reads then you're only annotating the things that were actually called the gene in the first place and again it comes back to that approach where BLASTX is slow and you want to try to limit the number of comparisons you're making and cons is that there's no sort of good learning basic so for a lot of gene prediction programs they base the prediction on similar genomes where they know the composition and also the features of genes and how they're called in related organisms you don't really have that training base for a community because it's a community of genes and so that really doesn't work very well these long reads a lot of them cover entire genes so in genome assembly you actually have full genes start and finish anyway but these reads if there are 100 or 200 base pairs you know you can have anywhere you don't have the start, you don't have the end either of those or mixture and so does it really make sense to try to annotate something that's a gene when you only have a partial read anyway it does usually often require assembled data so when people are doing most people that do gene calling on menjo updates they've assembled it in the first place and then that's sort of going down the genome pipe there are particular tools meant for metagenome assembly metagenome gene calling there's frag gene scan and some of the other traditional sort of gene callers of metagons metagenome option so they are up there it's just that people tend to just do a blast x they just do a 6 frame translation and then do the search directly without the gene calling okay any questions about that rates okay community function potential this is my sort of thing like this morning where I said we shouldn't call things as absolute abundance it's relative abundance this is my little pet peeve about actually it's not my pet peeve I often get this spell a lot of people so I'm pushing it back on you so you know what people are talking about so we do metagenomics we're not doing metatranscriptomics that's what John's going to talk about tomorrow and we're not doing medical proteomics right we're getting the sequence of the actual gene and so to say that this community is functionally doing this compared to this other community is not really true it's the potential right because we don't know if this community with their genes are actually transcribing those genes we don't know that those genes are then entering into proteins the same with genomics same argument with genomes versus transcriptomics versus proteomics right so within the microbial community it could be that yes this community might have double the number of dysfunction but we don't know for sure that that community is actually transcribing and using those genes you just have to be careful to remember that it is really potential and that's not for sure thing unless you do some follow-up metatranscriptomics or you have metabolomics to actually back up that means really doing it it's the potential that they have that being said microbes tend to be fairly thrifty and they change over time fairly quickly so I feel fairly calm in saying that when you do see these changes in gene relative buttons it probably suggests that functioning they are doing those things but you just can't claim it it's just suggestive it's not for sure does that try to get that point across does that come across clearly okay so that's the last of my sort of I don't know what I call biological slides there's a couple of things that I want to cover now at the end that's sort of a wrap up and also sort of an advertisement and sort of an information pick okay first is microbiome helpers you can have a repository where I just throw scripts in there that I find useful for my own stuff it's a it's a site that has a few things it has helpful scripts that combine different tools use a couple of them in the lab already you're going to use a couple more in the lab coming up it's on a github site the nice thing about it is that we sort of update it whenever we need to change things or tools change because it's what we use in the lab and the other nice part of it is that it's a standard operating procedure so for 16S and for men genomics what we use in our lab we sort of have step by step the steps we run on the data from start to finish and that changes over time its tools change but it's there so if you're looking for sort of who I wonder what's you know Morgan's using in his lab lately or if you're looking for a guide you can use those scripts, download the tools that those scripts wrap and then it sort of process the data in an automatic fashion it's a small sort of semi-advertisement from that this is meant as an overview slide so the take-home message here is if you haven't already realized this is that there's common formats or conceptual ideas here this can help if you haven't grasped this already so yesterday you heard a lot about 16S RNA DNA and the fact that we get these was called O2 tables and we were doing that with China and other and what happens is we get this table and it can be a different file format but the whole idea is that we have different counts of these O2s across different samples and so this just means that there's O2s twice in the sample the same thing happens when we use Metaflam which we did this morning we didn't get O2s we actually got the names of the things we got the genus or species of the file level of those same counts to get to the same sort of table format today, this afternoon we're going to be doing the same thing but now instead of O2s we're actually getting K-orthologs which we can then collapse into modules and pathways but it's the same sort of data and what's cool is we'll do a stamp this morning where you stamp the same way and sort of do the same sort of slots the same sort of ideas it's just constructively nice to understand this data once you get to this format actually it takes this type of information it's O2 table and it predicts what you would get if you had the metagenome for the same sample without doing the same and I might talk about that in a minute about how it works but the idea is then you can get a predicted metagenome and then do the same sort of comparison of what might be significant in that sample alright information is almost over I do want to mention this mostly because people go back to where they were coming from so here at Dalhousie we recently started what's called an integrated microbiome resource it's sort of based out of my lab but it has many collaborators on it we're offering it sort of as a resource we offer some sequencing abilities but also the bias along with it so that if you're ever... the idea is of course you go back and you just do your own microbiome analysis and you'll help out all your friends but if not and you get sick of it and you just want to send samples to us for a tidy fee we can do the sequencing and the Bioinformatics for you and ask you about the results either in stamp files or OQ tables or nice visualization we obviously didn't talk about wet lab too much because this is a Bioinformatics workshop and I'm not a huge wet lab expert at all so far from the IMR we've developed this in a sort of 16S apricon workflow if you have more questions about this I think you can talk to Andre who's actually doing the workshop and he sort of developed it the cool thing is that we're doing this for the aluminum eye-seek and so we have these single PCR stuff where we have the fire codes as well as the sequencing adapters all together and then he's doing verification of this on a little e-jail and then there's normalization to a couple of different AP&A between the samples and we've been using aluminum eye-seek in the topper building we're using 300 and 300 base pairs we get overlaps from the 16S we're also getting an aluminum eye-seek hopefully next spring which is next door to the topper and we're hopefully going to do metagenomic sequencing for a decent cost because right now it's kind of expensive on the eye-seek which is 300 to 400 dollars a sample maybe 600 600 dollars and I think with the next leap though we'll see if we can move back down to the $100 range I'm hoping but we'll see and then lastly is my last slide and also sort of an overview hopefully by the end of this you'll see that Holly's major tools sort of work together in a large workflow where we have our sequencing we do quality control pair and stitching we use chime for 16S data or medicine on humon what you're going to talk about we need to obviously this wasn't covered in this workshop and there's normal stuff in the future just meant to be an overview of the past two days so hopefully things are starting to understand why things fit together in a large scheme of things okay so that's it do we have any questions on any of that so far so we have two options one we can do a coffee break and then lab or two I can go over the pie crust lecture right now I got about five to six slides do it with the pie crust now I'll get all done with she decided it's her fault yay pie crust so I can take questions in detail about this so this is what I just showed a few minutes ago pie crust is a pathogenic investigation of community states I have to put that up there because I don't remember it it's hosted at a GitHub website it was published in 2013 in nature by technology and it was a large collaboration between third of sun power Rob Vico and Rob Knight I was a postdoc and Rob's out at the time and some people I still haven't met but we've had so many phone calls it's kind of silly actually so the idea behind pie crust is how does it work so the idea is when we put names on a species an O2 we do this O2 cluster and assign tax on it why do we do that why do we put names to O2 why don't we just treat them as O2 we could do PCA plots awesome someone copy that yes absolutely so we associate names to particular functions or ecological context so we already do this but that's why we put names on things because we think we know something about when we say E. coli that's not a good example but when we say anything about O2 we think we know something about that we think they know where they are or where they shouldn't be things like that we really do that in some sort of quantitative way we say knowing that we have these organisms in the sample what can we tell you say about what functions we think those are doing in the sample so the way that we do that is we have a 16S reference tree in this case we usually use the green genes tree so it's fairly large it has both tips with sequences that are reference genomes where we actually sequence the genome but many many more tips where we just have 16S sequences from the environment some idea with reference O2 where we take some 16S sequence or we associate with the tip of the tree we say that this sequence matches best to this tip of the tree and then we know information about how this tip relates to the rest of the tree so if we zoom in a little bit to this one sampled spot the way pie crossover works is that we take information about neighbor genomes so in this case we have a really good sampling but for these red guidance we have a genome sequence it's in enlarged 13,000 how many genomes are in the NCBI 12,000 and we have to include drafters and for a particular gene I'm going to say a particular k-ortholog where you copy down how many times we see that k-ortholog in that genome this genome has this k-ortholog twice this one has it once this one has it twice this one has it four times this one might have it none these are tests where we don't have any information they're just environmental samples and the idea is that if we have this guy and we want to ask what do we think is the copy number of this step right here and so the way we do this is we ignore to the ancestors and then we use the biogenetic ancestral statement construction which attempts to figure out based on this information also the rest of the tree what the copy number is of this gene what it was of this ancestor and so this has a one so it's number two we can use carcinology but there's maximum likelihood methods of it as well to try to predict the copy number here and these numbers will spit out continuous variables so we'll often do that obviously you can't have a .2 of gene but we keep it as a decimal so that the community sort of goes away and then we'll often get an estimate too and then we make some type of inference down the leaves here to this tip and we predict what we think is the copy number of this gene and our error bars increase okay so that's how it works for a single gene for a single tip okay so then we have to repeat that now for say 8,000 type orthologs so we do that now for each gene family and that's for the same thing and then once we do this we actually have a problem at this point so we think of the k-board logs we have a prediction of what k-board logs we can make but then we want to have it pre-calculated now for all the other tips and treats we actually repeat that and we pre-calculate this for all the tips at the 97% on two so we repeat that so that's sort of 100,000 tips that all goes into a little file we do this a couple of cool steps the first is we do take your O2 table and we can actually sort of correct for 16s copy number besides functional prediction so the idea here is that some genes some organisms have a different copy number of 6nash right so some have 3s and some have 2 most of the time people just ignore that so what happens is you have 3 16s copy genes and you sequence it that organism has a bias 3 times more in your sample so there's no, there wasn't a rate there was a couple of patients that came up right before that attempted to normalize as well and so what we do is we actually estimate this as a 16s copy number just like you have the other traits and we divide this numbers into here right so with 1,2,3,4,5,0,2 we have a copy number of 5 so we actually normalize this guy so there's actually only 2 and 1 and so that's what we get our numbers here right so we've actually corrected for the 16s copy number and so now we have what we call a corrected OQ table or a normalized OQ table for 16s copy number and then you can actually do all your stuff with this afterwards turns out it doesn't have a huge impact on things of course if people are writing to paper say it does, it sometimes does but you know if you look at a PCA applied it might not have a huge effect but if you're looking at particular abundance of the different organisms so you can take this and use what you want to do and then the next step is sort of where the magic happens where we take now the K predictions we have 3,000 different K 4th log predictions we do a major multiplication to say in fact we have 2 of these to use in our sample and we have 4 of this K ortholog then we would predict that this K ortholog is at least 8 times in the meta-genome contributed to by this one guy so that would be let's take a quick example so let's say we have because each log T is going to contribute most to that K-ortholog so if we had this K-ortholog there's 4 copies of it in this OQ and we had it only occurring once in the samples to be read and then we would expect 4 of K-002 in sample 3 okay, that makes sense so it's a major multiplication where we just add up basically the contribution of each K-ortholog from each one okay, and then we have a predicted meta-genome and then you can do what you like with it after so that's how PyQuest basically works I did not include any extra slides about validation or things like that you can read the paper of course I didn't really want to be in super detail I just wanted to finally understand how it worked we basically tested on human microbiome samples we also did sorry, we did soil samples we did hypersaline mats and we did animals yes, in general they were like zoo animals so in general what happens the big question is how accurate is PyQuest the accuracy of PyQuest depends a lot on how good we have reference genomes for the bugs from that community so if you go sample in the outer atmosphere where we don't have a great representation of organisms that's not the greatest example maybe if you see then PyQuest wouldn't do as well so it matters a lot of how many genomes we have but for human microbiome project we were in the range and we could actually recapitulate some of the same biological findings from real metagenomic samples we did paired samples where we took 16S data, predicted the metagenome and we compared those against the real metagenomic data from those same samples that's how we developed it okay, so that's the bonus lecture on PyQuest, this has it's just sort of a workflow it doesn't really matter but you can get these O2 tables either through a Chine specialist code reference or if you're using PyQuest there's a main file format you can also make a PyQuest compatible with a Chine table and then as you can read these ideas that we've done you can find them through the copy number and you can just click on the window and you can flash those K-Board blogs into modules and categories which then is a huge first status for visualizations okay, any questions on PyQuest?