 John Parkinson, I'm from the hospital for security in Toronto and what I'm going to be presenting today is kind of an analysis of the metatranscriptomics, what we're doing with metatranscriptomics, what metatranscriptomics is and perhaps why some of you guys might want to consider metatranscriptomics for your analysis. Anyway, I hope by the end of this talk you'll have maybe a bit more of an appreciation as to why you might want to consider metatranscriptomics. All right, so I just want to again mention this talk's come from Creative Commons, so it's going to be available online and feel free to use the presentation as you see fit and slides and so forth. Yeah, so just going over what the objectives of this kind of module is. So this is going to be a presentation hopefully not more than now and then we'll hear from Jack and then we'll have a practical for about an hour going through some of the pipelines that we've been using for analyzing metatranscriptomics data. So again, I think the idea here is to really get you understanding what metatranscriptomics can do, gain an appreciation of perhaps some of the challenges in the sample collection and experimental design. But the main focus is going to be on perhaps the steps in data processing. And then the tutorial is really based around being able to simplify relatively simple metatranscriptomic data sets. It's only 100,000 reads. Normally, when we're analyzing these data sets, there's again 100 million to 3. So an overview for the presentation, what is metatranscriptomic, how does it relate to RNA-seq, a brief fit on experimental design, software preparation and so forth. But the main focus is going to be on the processing of the reads, but on the statistical analysis visualization. All right, so why should we consider metatranscriptomics in our analysis? So as Morgan very eloquently outlined yesterday, some of the differences between the different technologies that we're applying to study microbiomes, we have 16S surveys. These are very informative. They tell us who is there, but it doesn't really give us much in the way of mechanistic insights. And this is a study from a colleague of ours at UC Denver from that 2007-2008. This is a study of patients with IBD, healthy individuals looking at different metatranscriptomic groups and intestine. And so you can see that there's huge differences between IBD patients and healthy patients, but you have no idea if that's cause or consequence. This is a study that Morgan highlighted yesterday. This is a metatranscriptomic survey. Again just to outline at the top here we have different individuals showing huge variation in their fighter across different body sites. However, when you look at the functions, the functions look relatively similar. The suggestion here is that different groups or different taxonomic groups in your microbiome can actually result in the same function. So maybe we don't care so much about what organisms are there, maybe we're more interested in what they're doing. And so this is what metatranscriptomic is attempting to do. It's really trying to identify who is doing what within your microbiome. So how do we go about doing this? The whole idea between metatranscriptomics is that we're exploiting this RNA-seq technology really to determine which genes which pathways are actually being actively expressed within the community. So for example we might have a set of genes here that are involved in cell wall biogenesis by using metatranscriptomics. We might use these nodes in this graph for example which these nodes represent a gene. The size of these genes might represent the relative expression of these genes and so you can see within a particular sample which gene is actually active and then by layering on taxonomic information you might be able to identify which particular taxonomic groups are responsible for those particular activities. Okay so I want to go over briefly one vignette that we've been working on recently and this is a study looking at a gene called perilipin 2. So perilipin 2 is in this case it's a mouse gene or plin 2 and it's involved in lipid uptake in the gut and previous studies have shown that a knockout of plin 2 can actually modulate to microbiome structure and so are you interested in understanding if you do the knockout what impact does it have on microbiome function? So what we did, we designed an experiment, we had four mice in each group, there are four groups, we have a plin 2 knockout, we have wild type and we're exposing these different mice to two different diets, so a high fat diet and a low fat diet and when we look at the taxonomic groups that we would cover under each of these diets so what we've done is we've taken sequel contents and then we've subject to these metatranscriptomics, we generate about 20 to 30 million reads per sample and then we've used these metatranscriptomic categories, so we're not using 16S data here to actually look at the abundances, we're actually looking at the messenger RNAs to determine the taxonomic contributions within each of these samples. When we do the statistical comparisons across these samples we find that diet has a significant effect and you can see changes in shifts in microbiota and social diet as we might expect. However when we look at the plin 2 high fats, wild type high fats, these two mice, so two different genotypes fed the same diet, the actual taxonomic abundances looks very similar, so under a high fat diet the plin 2 knockout mice exhibit a similar microbiome to the wild type. So this is kind of the reverse of what we found in that metatranscriptomic study where you found that different individuals can exhibit huge diversity in terms of their microbiomes, here we've got two different genotypes of mice and they're actually having very similar microbiomes in terms of their structure. So what about function? So under both low fats and a high fat diet plin 2 and wild type mice exhibit this genotype specific differential expression of highly expressed genes, so when we're doing the mapping and I'll explain how we do the mapping of all of these metatranscriptomic data sets, we found we were able to map all of our reads to around about 57,000 transcripts and across all of these transcripts we found that around about 73% of them are shared across all four of these samples. When we look at these highly expressed transcripts however, so these are ones with this RPK and this reads per 1000 reads mapped, this is a way of normalizing expression across different metatranscriptomic different RNA seed data sets, we find that each of these different samples exhibits the expression of their own sets of genes. And so again when we do the statistical analysis we find that there's about 1300 genes which exhibit differential expression across these different sample types. Okay so what we can do is we can subject these differentially expressed transcripts, these 1300 differentially expressed transcripts and we can do gene set enrichment analysis or pathway enrichment analysis and we identified 22 out of 180 metabolic pathways were enriched for these differentially expressed transcripts between the plinter and wild type mice that have fed a high fat diet. So remember these are two mice, two groups of mice that were fed the same diet didn't exhibit any differences in their taxonomic composition but when we look at the functions they're actually expressing different pathways and the idea here is that the plan to knock out what's causing this to happen in the mouse is that you're getting an accumulation of the triglycerides in the gut and then that's somehow impacting these metabolic pathways, a lot of them are associated with amino acid biosynthesis, energy metabolism and it's causing changes, redirections in the flux of the pathways to account for this additional quantity of triglycerides in the gut. All right so again this has just been yet again to show you why we think it's important to consider metatranscriptomics. So these kind of functional differences you wouldn't have observed using just 16s analysis. All right so how do we go about doing metatranscriptomics? So the basic process is through this technology RNA-seq which has been around a number of years now RNA-seq it's the unbiased sequencing of an RNA sample to really unlike microarrays yield a digital readout of the relative expression transcripts in the sample. So here we have a mouse which extracts the RNA, we then fragment all these RNAs, we sequence it, we then align these reads to known transcripts to get a readout of the relative expression. Now typically RNA-seq is applied to organisms where you have a reference genome, so this mapping exercise is really easy, however as we know microbiomes are a lot more complex and so microbiome applications really face a number of additional challenges that traditional RNA-seq methods don't have. Are there any questions on this first part of the talk so far? Yes? Excellent point, so I'll cover this in the next hop besides. So in a typical RNA-seq experiments you apply it to a eukaryotic organism you can enrich for the messenger RNA because they have these polyA tails. Bacteria don't have these polyA tails and as a consequence ribosomal RNAs tend to be in significantly large rebundances in messenger RNA so this is a huge problem. In addition as you've suggested if you have host contamination as well then this is also an issue. So we did a stool sample from an IBD patient a couple of years ago we found that about 95% of the sequence reads were human derived and only 5% were actually coming from bacteria so host contamination can be a significant problem. The other issue that we're facing is that environmental samples really lack reference genomes so this makes it incredibly challenging to map reads back to their source transcripts. John I think in the libraries as well you have now pools like where you can remove polymeric in a RNA. Next slide. Thanks for that though. Oh sorry the slide after this one. Thank you pardon. All right so this is what a typical metatranscriptomic analytical pipeline might look like. Pretty similar to a metagenomics pipeline so here we have our mouse obtain the RNA prepare it for sequencing so this is a step that Jack was alluding to where we might want to enrich for certain varieties for messenger RNAs generate the reads remove low quality so you have some fewer reads and you started off with here remove the right zone RNAs you have even fewer reads remove host reads hopefully you've got something left at the end here and then you can assemble identify the bacterial transcripts that they originate from by doing this mapping and then once you've got these transcripts start mapping into these pathways to actually understand what kind of biological systems are actually being upregulated or differentially regulated in your ticker samples. All right so one of the problems with sample collection RNA extraction is that RNA unlike DNA deteriorates rapidly okay so the method of storage preparation can really impact how much you can recover and even which taxi you can recover and given that there's only three of you who are actually doing any metatranscriptomic work at the moment ish perhaps it's probably not surprising that the methods for actually extracting and processing and storing microbiome samples for RNA extraction really hasn't been standardised yet so that's still going to be a work in progress so we found that the best obviously the best is going to process immediately and ideally to sequence but at least if you can extract and purify the RNA then if you can store it at minus 80 then it tends to be relatively stable. Next best is to snap freeze in liquid nitrogen and store at minus 80 again but again the longer you store it then you do start seeing deterioration in the quality of the RNA. We would suggest I know some people are interested in using RNA later we have we suggest avoiding the use of RNA later to to maintain the integrity of the samples as we find that it can lie some cells it can interfere with some of the RNA extraction kits that we use and so in general we don't tend to use these RNA later kits. Metatranscriptomic it's not cheap so I think it was mentioned yesterday that we're down to about 30 dollars a sample for 16 s here we're talking around about 300 to 400 dollars a sample the cost isn't necessarily sequencing the cost is actually in the library preparation okay so the kits that you have to use in order to generate the sequence libraries these tend to be quite expensive. So this raises some interesting questions as to how many biological replicates do you need suggest maybe at least two I think we're trying to do at least four most of our analyses at the moment again these can be very challenging especially for considering human samples where some of these our ability to get these biological replicates is really limited and again power analyses also are incredibly challenging so I think Rob alluded to a paper yesterday which was looking at power analyses in 16 s datasets and that was published in 2016 we haven't really got very far at all in terms of understanding how do we do power analyses for metatranscriptomics so these are some of the challenges that we're facing with metatranscriptomic for the moment okay so on to the sample preparation so as I mentioned bacterial message RNAs they lack a poly A tail and generally ribosomal RNA species tend to be highly abundant so they would generally represent 95 to 99 percent of all of your RNA sequences so there's a number of kits now that are available to remove some of these abundant ribosomal RNA species and they seem to be improving over time so the one that we're recommending this is ribo zero gold kits this is available from Illumina this is their data so they're claiming huge amount of success in terms of how much they can deplete we originally started with a ribo minus kid in 2012 and I think we recovered about 25 percent of the reads were actually messenger RNA and about 75 percent was ribosomal RNA so it reduced it a bit not much and then a couple of years later we we tried this ribo zero kit I think it was the first generation we're able to increase that to about 50 percent messenger RNAs and 50 percent ribosomal RNA the latest one so that data that I showed you right at the beginning from these 16 different mice there we ended up with about 70 or 80 percent messenger RNAs so it seems that the kits are improving and they're doing a pretty good job of removing these abundant ribosomal RNA species and then as we mentioned also host messenger RNAs can also prove challenging at the same time you could spin this around a thing actually this is going to give us information on what the host is actually expressing so there's a potential there when you're analyzing your data set that perhaps you could be analyzing these host mRNAs to see what the host is actually doing in response to your microbiome all right generating reads how many reads do we need to generate how much is enough so this is from an analysis we did a couple of years ago where we took four of these published metatranscripts so in data sets we did a rare fraction analysis and recounted the number of enzymes okay not the number of transcripts but the number of enzymes that are associated with our sample so to give you the idea that perhaps we're not so much interested in the specific transcripts or the specific organism that these transcripts are coming from maybe it's a function that they're contributing so we can do these rare affection analyses just look at these different functions in this case enzyme classifications and we see we kind of saturate around about five five million or so reads and so we're thinking given loss of reads as you've been generated through the various sequencing pipelines maybe about 20 million reads per sample is sufficient so that's our current recommendation again that might go up as we learn more and more about some of the problems with the statistics of analyzing metatranscript data sets how do we analyze the data so metatranscriptomics very much in its infancy relatively new field relatively few robust software standards and methods that are out there so these still need to be developed and new tools are continuously being created so the concept here is that we can have some kind of pipeline but it should be relatively modular so that you can swap in and out different pieces different pieces of software as they show continuing improvement so this is our pipeline from a couple of years ago we take the raw reads there's a pre-processing step so this means we take the raw sequence reads we get rid of the low quality get rid of the ribosomal RNAs get rid of the host transcripts adapters and so forth so there's a pre-processing step there's then an assembly and an annotation step to tell us what these reads actually are so there's new assembly methods that are coming out new annotation methods that are coming out and then finally we have some kind of groups of analyses that we might want to do and again these analysis methods are going to be changing so a lot of these methods particularly around processing at the top these can use a lot of these existing tools that have been applied to analyzing any kind of high throughput sequencing technology these are a couple of pipelines that have been published recently last year in fact so SAMHSA again there's a pre-processing step there's an annotation step they don't actually do an assembly step and then there's they aggregate some of the results from the annotation and then there's kind of an analysis step afterwards uh there's this one called IMP uh this is interesting because it combines metagenomics with metatranscriptomics so the idea here is rather than doing metatranscriptomics in isolation you actually do metagenomics at the same time create a whole bunch of contigs as an assembly and you can use those to then map your metatranscript reads on to so it might make it easier to do the annotation downstream so again they have a pre-processing step they have an assembly step and then they have some kind of annotation and analysis steps so these are the kind of frameworks that we're working with for developing these pipelines so how do we do the reprocessing so the idea is that we want to from our all the reads that we generated from our sequencing machine want to identify those reads that are derived from messenger RNAs um and so there's a number of contaminations that you get so you get low quality data you get adapters that you need to trim off you have host uh data that also needs to be removed ribosomal RNA so these have already been covered to some extent when we were analyzing the metagenomics data so I won't go over those just one thing to point out this infernal process to identify and remove the ribosomal RNAs this is a real pain because this is probably the rate limiting step unfortunately we haven't found a tool yet which is able to have the same sensitivity as infernal but it's a very slow step so whereas a lot of these steps could be run on a relatively modest workstation we find that this infernal step is so slow that you really do need to think about putting your analyses onto a supercomputer once we have removed all of these contaminants and we're left with what we think a putative messenger RNAs we then do an assembly step and the reason we do an assembly step is we found that assembly improves annotation accuracy if you can assemble these relatively short reads into longer context your ability to annotate them rises quite dramatically and so this is why we're proposing that you do an assembly step um we looked at a number of different assemblers uh a few years ago we found that trinity goes the best performance the highest proportion of reads that could be annotated however new tools are coming out all the time as i suggested and we've now swapped trinity with our spades okay so again just remember when you're developing these pipelines if you are developing these pipelines to think in terms of modular design so you can always swap in better performing algorithms as they're developed one issue that could arise that we haven't had too much um issues with are chimeras so these are misassembled contigs and these can be uh particularly problematic where reads are driving from all plugs from different species we haven't actually found this to be too significant a problem in metatranscriptomics we find that anything from about one to three percent of our contigs might represent some kind of chimera so it's relatively low and then there are tools such as u-chime that you might be able to apply to actually identify and correct some of these chimeras okay so we've got our assembled contigs we've also got our unassembled reads that we couldn't assign into contigs how do we annotate these how do we ascribe some kind of functional annotation associated with what can be relatively short reads so um this is a little bit depressing that uh despite the fact that blast was created in about 1990 or so we're still reliant on blast or blast like tools sequence similarity tools for doing this annotation okay um we do adopt tools such as bwa and blatt these are relatively fast however the problem with these tools is that they rely on near perfect matches and as we know you can get particularly in environmental samples a lot of diversity at the nucleotide level which means that you just can't get these sequence based matches the other problem that we find is that even when we're sequencing from a different strain from a from a species that may be well characterized we can identify a whole set of new genes that we haven't seen before so this is a nice study from 2007-2008 uh which was looking at the genes associated with streptococcus very strange of streptococcus agylacti and what they did on these rarefaction plots and they find that as you sequence additional genomes of streptococcus agylacti you get additional new genes that you haven't discovered before and as we know this is a common element pretty much for any species that are out there this is this pan genome concept where each of these different species are sampling from different portions of this global pan genome and so we may not when we're using these kinds of tools and we're trying to do sequence similarity searches against reference databases we just might not have all the genes associated with this pan genome so one solution to get around this idea of all this nucleotide diversity that you we see at the strain level is to move in peptide space and to use these blast-ex like searches against protein databases so we've been using blast up to about two years ago it's slow it's time consuming but fortunately now we have diamond it does pretty much a similar job to blast I think I remember we did benchmarking around about 95 percent of the hits that we got in diamond were exactly the same as blast so it's basically doing a similar job to blast but it's much much faster I think did somebody mention it was about 40 times faster or 100 times faster than blast so much much faster however even with these blast-ex like searches we still get large proportions of breeds that are unannotated so these are five different sets of metatranscriptomes this blue this depressingly large blue bar here are the proportion of breeds that could not be mapped by either blast, blat, or BWA now we're hoping that as sequencing platforms deliver longer and longer read lengths that these kinds of issues might start going away but again annotation identifying a match within our datasets can remain problematic right so are there any questions so far on this first part of the talk okay great so this is a tiered set of searches so we use BWA and blat against a set of about 2,000 microbial genomes and then blast was performed against the protein NR database okay so we're hugely reliant on blast because BWA and blats really rely on these almost exact matches which is great if you have those reference genomes or any sequence then they're part of your metatranscriptome dataset but we don't so we are really reliant on these blast-ex matches so this is a typical match for a 71 base parade we're now up to about 100 or 150 bases as a standard for doing metatranscriptomics but this is a typical blast report we get an E value of 39 so it's not e to the minus 39 it's actually 39 okay so this really wouldn't be considered a statistically significant match but we look at the species that looks about right and so when we look at the summary for the matches we do find that there's a large proportion where a lot of their reads match at very high density across the length of the particular match and so rather than using these evoes we prefer to use cutoff space on read length as well as a percentage identity of match okay any questions on that all right so again I've mentioned that from that metagenomics study maybe we're not so interested in knowing which particular file which particular transcripts might be present in a particular microbiome maybe we're just interested in the function so the way that we're thinking about this annotation strategy is first of all to map to a known transcript but then because that transcript could be a spurious match we know that there could be about 10 different blast matches that might map your sequence read you're just taking the top one but what we do is we map the known transcript to a more general function so it could be an enzyme so the idea here is that even if your sequence read could be matched to 10 different transcripts with the same probability each one of those transcripts is likely to have the same function so it could be the same enzyme function and so because we're not particularly interested in the specific transcripts we're more interested in the specific enzyme function then we do this mapping so we map initially to a known transcripts and then we use that transcript to map to the more general function so these bar graphs here represent basically the number of reads sequence for 16 different samples the putative messenger RNAs associated with these leads that were sequence and there's a high correlation between these two unique transcripts associated with each of these so this is relatively invariant there seems to be a problem with this particular sample here however when we map to the unique enzymes then we get a much more standardized kind of sample so you can eliminate a lot of the variation if you start mapping into these more general functional categories all right so one thing that we have with these messenger RNAs and these tags is an ability to measure the relative expression of these transcripts of these enzymes have been our particular sample but it has to be normalized to some extent so we need to account for differences in gene length so longer genes are likely to give you more reads just because they're longer so by random sampling you're more likely to sample from a longer read and so you have this reads per kilobase of transcript mapped so this is a kind of a transformation where you convert your raw read counts based on the length of the sequence into these rpkm values and there's a bit of a calculation behind it okay so this is the way that you can normalize standardized the expression based on the length of the actual transcript and there's several software tools available to do this mapping calculate these normalized expression profiles and this is a fairly standard part of these pipelines there something to be aware of though is that you do need to normalize your read counts into some kind of expression value to account for the fact that you can have relatively long reads okay so that covered yes i'm sorry yes and we're looking into that that's that's a great idea that hasn't been put into any of these pipeline jets but yes that's certainly a direction that's in export at the moment is which housekeeping gene should we be focusing in in order to look at the relative abundance within these cycles yes great point okay so that's yes i mean if you take a CEC you know the enzyme commission uh some of those are usually well conserved in some ways those could possibly be housekeeping genes and we have to be there in a bacterial so absolutely uh i think there was a paper that came out last year which predicted i think 20 or so housekeeping genes based on uh analysis genome analyses across all these different bacteria and they came up with this or it might have been might have been 102 in that ballpark figure and then you could use those potentially these would include things like a lot of ribosomal proteins and gyrases and so forth um so so we are starting to look at some of those genes and the ability to normalize based on some of those genes yeah yes so maybe you explained this but if i have you did um so we're now extracting RNA in this large bacteria present what's the validity of homologous homologous how do you like differentiate at the end which bacteria this came from so this is like so i'm gonna i'm going to be talked about in this next part so so how do we actually understand out of all these functions who's actually contributing to these functions so i'll be i'll be discussing then the next part okay yes follow up on the housekeeping gene thing uh i don't do much transcript work right now but my master's project is actually measuring gene expression by real time PCR and i did use a housekeeping gene at the time and and using housekeeping genes was seeing much more of a as a problem more than a than a plus side so so why would you be interested in using housekeeping so um one of one of the big problems with a lot of these uh whether it's you're doing 16s analyses or metagenomics or even RNA seed data is that you're just getting count data out rather than the absolute quantification and so if you are wanting to start addressing the ability to quantify within your sample what the levels are rather than doing more of the relative levels then you need to start thinking about looking at given that these housekeeping genes should be relatively stable across these particular taxa are these changes i'm seeing in these particular pathways are these real or not given that this is this is what we're expecting the copy number of this particular gene to be suddenly our metabolic pathway has gone sky high is that real or not or are all of the housekeeping genes associated with that particular organism with that particular pathway also in increasing that amount so it's a way of trying to get at this way of normalizing across was there yes it's also connected to the whole housekeeping gene thing with respect to the problem with metabolic genes is that microbiome response to food a lot and a lot of conditions it becomes a question of why rewiring the metabolism and the energy so actually we end up being one of the most very variable genes across the different genes that are available so maybe something more like a DNA synthesis gene that's because they need to replicate things to divide and grow but yeah there's a just a caution yep absolutely and it would depend on the actual type of metabolic enzymes and I imagine that certain metabolic enzymes are going to be less prone to that which again it's going to be related with active kinetics so in the meantime it's working very fast anyway it's going to be a complicated response to a particular condition because it's really not the sort of method that we're talking about so these kinds of rich of these 120 or so genes that have been found widely concerned across all these new bacteria which of these might actually represent to be able to do that across all these new okay so going back to the original question on taxonomic annotations so we've assigned hopefully RNA reads to different functions but we might want to know which texts are actually responsible for these functions so that we can identify the keystone taxa another idea is that can also help with binning for assembly so this is something I haven't really mentioned I'm not going to go into because I don't think it's widely applied for metatranscriptomics yet but the idea is that you might want to take your sequence reads bin them into different taxonomic groups and then assemble the ones you've done the binning okay so processes such as taxonomic annotation of your reads could help with that binning step prior to the assembly. So we could use things like alignment based methods such as BLAST and VWA but we know that these can fail where we lack suitable reference genomes they don't tend to be very accurate and so there's been a lot more interesting conversational methods so these are based on k-ma frequencies so here we might have a set of k-mas we might have a profile associated with these k-mas and then we can use methods such as nearest neighbors perhaps to actually assign a sequence to the genome with which it shares the closest distribution of its k-mas so there's a number of tools that use these kinds of approaches as Clark says NBC is probably the most successful one at the moment it's a little bit clunky but I don't think that there's a software piece of software that's actually been shown throughout perform NBC at the moment it's very slow though and then there's kraken which was mentioned yesterday there's a couple more that we have been looking at recently that were just published last year one is kju and this is one that we're using in the in the tutorial so kju is relatively fast so it's working rather than that the nucleotide level it's working at the protein level and it uses this virus reader transform so what it's doing it does a six-fold translation of your sequence read and then and these bits here might represent stop codons and then it sorts these frames into these by length and then it tries to find these maximum matches to known sequences across going descending order to find what is the largest match I can get and then it decides to read to this tax on with the largest match so because it's working at the protein level it can account for sequencing errors or or subtle nucleotide differences the problem with kju it has relatively large memory requirements which is a bit of an issue so I think this might require something like a hundred gigabyte machine to run on the plus side it does have a really nice GUI at the end which you'll see in the tutorial which enables you to explore the taxonomic distribution of your data set so that's that's one that we started exploring another one is centrifuges was also published late last year so this again uses a spiral readers transform again it's very fast it can account for sequencing errors and it improves over the existing kamer so kraken was I think the precursor to centrifuge kraken requires up to about a hundred gig this centrifuge requires much less memory and the reason it's able to use much less memory is that it has this kind of cool compression algorithm so what it does it takes genomes from the same species and it compares them and then it identifies where two genomes might be exactly the same and it just takes out the new bits of the genome and it adds it to the database compares the next genome finds what the differences are between this genome and this combination of genomes adds them in and so forth so this can really significantly compress your database you end up with a smaller database to compare against and as a consequence of that you don't need as much memory to actually run these searches so I think we can use centrifuges have to be used I think on a eight gig ram machine so this is actually possible for us to do on desktop the other interesting feature it does is it assigns reads multiple taxa so that it doesn't say there's one best hit it would say that when it could be any of these different things down here and you could force it to take the top hit or you can report what the different distributions are so again this is an interesting feature that you may or may not want to use in terms of performance so I could show you the performance from their particular papers on their particular data sets and this seems to be an interesting thing that whenever somebody produces a new software tool and they end up benchmarking it it always does better than their competitors potentially because they use their own data set so really not so I so I'm not showing what their kind of benchmarking is from their papers here I'm showing against one of our mouse gut microbiomes diamond is our standard tool here so this is a breakdown of taxa from these meta transcriptomic reads according to diamond this is kju and this is centrifuge so you see that kju and diamond are pretty similar and that's perhaps not so surprising because kju is in effect doing a blast x type algorithm it's turning stuff into peptide space and then it's doing comparisons at the protein level which is what we're using diamond for centrifuge on the other hand gives you quite different results so we've got a set of reads down here which are I think thermicutes and these are greatly reduced relative to these other two tools when we look at the percent reads annotator we see that centrifuge does a much poorer job of being able to annotate reads rather than kju which itself is not as good as diamond so potentially the reason why we see fewer reads being annotated to thermicutes here is because it's not able to annotate some of these reads up here okay so this could be a problem with centrifuges is that it's biased towards identifying certain taxonomic groups so we're still exploring again how best do we start exploring and integrating some of these tools to give us a more robust platform for taxonomic annotation associated with their meta transcriptome data sets any questions on taxonomic annotations good just one thing to bear in mind when we're trying to compare the taxonomic annotations that we get out from meta transcriptomic data it doesn't correlate necessarily with the taxonomic annotations with the 16s data and that perhaps shouldn't be so surprising first of all you could get artifacts due to biases in the run terminal RNA or the mRNA sequencing processing steps so that's obviously going to confound the comparison between so here we've got messenger RNA my terminal RNA messenger RNA my terminal RNA so these are different pairs what's happened here so those were different those were different pairs of samples that were being analyzed and pretty much across the board all of the messenger RNA profiles look very similar to each other and very distinct from the ribosome RNA profiles yes there's a certain amount of percentage of feeds that are not and not it but what tell you that it's which one is the best one because it can have and not they did but it's a wrong annotation so how do you know which one is good uh with your rotation when you compare uh something true we keep you and uh blast so there's a number of simulated datasets that you can produce and that's what they do their benchmarking on so they create these simulated datasets and then they compare how well they do for these simulated datasets we actually had a mouse dataset where we think we know what this taxa actually are so this was a mouse meta transcriptomic was inoculated with something called outer shedler's flora which is supposed to contain about eight or ten different taxa and we can use that and the genome's just got published i think about two two to three years ago so we can use that as a kind of a gold standard in a real meta transcriptomic dataset so we've done some benchmarking on that and as i mentioned NBC which is that naïve base naïve base classify which was actually produced back in 2007 the 2011 reference was a web-based form of that actually performs the best and it seems to be incredibly bit difficult to out-compete that NBC platform rob so i like the abundance is not the same as expression because it's sort of a leads to one of those operating truings of biology and biology matters um RNA mRNA degrades at different rates depending on the sequence of the structure how much of a problem is that material different sort of mRNA applies so we go back to our stool sample we try to analyze this by looking at 16 different ways of preparing the RNA and storing the RNA so we stored it at four degrees and i think minus 20 and then we processed immediately or we left it for a week and then we added RNA later or we didn't add RNA later unfortunately when we performed that experiment we ended up with 95% post reads because the sample we got was from an individual who was suffering um IBD and what was interesting to us was this seems to be an incredibly high amount of host mRNA that we were getting out from the stool sample given what other people have been reporting turns out that uh colleagues in Ottawa had also found similar results from IBD patients and stool samples for IBD patients that there seems to be a large amount of host material in their stool samples relative to bacterial samples so we weren't able to follow through and see what the different impact of the degradation was that was exactly what we wanted to do actually that's that's my question the mRNA transfers different genes and have different lifes right so the pucos kinase gene like produced in mRNA the last 15 minutes on average whereas um all the phosphate kinase can produce in mRNA and that's that's an important part of the regulator instead of but i don't know the story in uh back here should something be done uh pass so potentially that's why people are maybe moving on to the metaproteomics in order to get around that whole relationship between the number of messenger RNAs that you can predict and the actual proteins that are actually annotated so this is absolutely right there's a number of confounding factors that the gene expression profiles that you get out aren't necessarily reflective of what the actual protein abundance is but jack's probably going to correct me on this no i'm not correcting you know i just think you're pushing the problem forward i mean if you have a protein you don't know if it's modified in the way then well i've been affected by the phenotype you know you can be pth in the process but it's actually a lot of attention to the likes as well i mean it's not very satisfying not letting you do anything but it is just just a lot of work but i guess it wasn't school right so obviously you really apparently turn over at different rates but that's kind of a part of that but you realize you think that's naturalized has anybody here has anybody here looked at looked at what protocols they're using for RNA storage and processing well the DNA genotech has a school processing sample where they state that RNA is about stable in there they have some data on paper from a group that's independent from the company but you know in the old days we use the when is your mouse or day of signing it's kind of the only one that seemed to work so the so the one issue i'd have with that is how do you know what kind of biases you're going to be getting out unless you start comparing and doing that benchmarking so again it's it's potentially the same as with the 16s we really need more rigorous benchmarking of all these different kits across standard samples that we can say what are the best ways that we're handling these kinds of samples and back to Morgan's kind of point that why don't we just snap freeze these things we find that even when we store these samples in minus 80 by the time we get around to actually sequencing them about even six months or a year later you can get degradation so this seems to be issues with any long-term storage of these particular molecules and it doesn't make any sense to me if you put something at minus 80 why is it still deteriorating but that seems to be our experience at the moment was there another question yes yes sir so the idea of being metatranspritomic is to look at the activity task from the bacteria as opposed to the bacteria which might be growing like this in a metabolic environment so uh so does that so you said that mr next question is not particular to r r and a abundance and that's a good thing because initially it seems to be surprising but we want it to be oh i so i do apologize if i if i did suggest that that was a bad thing again absolutely abundance is not the same as expression okay so exactly right um we we do actually want to see what the expression is and we're not expecting things to be the same as the ribosomal RNA profiles sorry if i was uh little confused with that yes i'm not quite uh but uh i would assume that you know when you do the the processing of the RNA there's a step when you do normalization at which at which point um maybe we can do the library so i'm not quite yeah so this is where the r p k m value comes in so you're normalizing for how many reads have been generated in total for that particular sample and you're also normalizing for the left as well so that's the normalization step i think that's the only normalization step that is actually going into this analysis okay so uh uh how can you account for the uh the relation of the the degradation yeah we can't and what's the other um this goes back to the RNA storage i just pulled out my last protocol and for stool samples we do keep in minus 80 if it will use RNA protect right if you've had any luck or bad luck it's not again i i wouldn't be tempted to use it but i'd love to hear what the differences between adding that and not adding it and the types of reads that you're getting out at the end of it if it does actually make any difference again it's one of those studies that need to be done um and i don't know did you mention this yesterday about the 16 s study where they looked at all these different sample kits and all these different pipelines and this was i think this is about to be published in nature biotech it's about three years old so it's all in a bit worthless because everything's moved on since then but these are the kinds of studies that we need these kinds of benchmarking we're using these kinds of standards to see what works and what doesn't work absolutely any any further questions comments all right good so we've generated our messenger RNA we've sequenced it we've annotated it we've assembled it we've done some functional annotation we've done some taxonomic annotations and now how do we actually visualize these results so we could start looking at things like taxonomy within our data set so there's this nice tool called krona and kju as i mentioned it features a nice script that converts its output into a format that's usable by krona so this is a tool that you'll be using during the tutorial and you can click and select these different groups and it kind of expands and it's quite cool and whatever i'm not sure you'd actually present anything like this in your paper but it's kind of this is the payoff you spend all of this time generating these data sets i want to look at it and then this is kind of a nice cool kind of payoff at the end this is something interesting i came across recently visbin so this performs a dimension reduction a dimension reductionality kind of like a pca to generate clusters of reads based on their kima distributions and the idea is that each of these clusters here might represent sets of reads that are coming from a similar but from the same organism so this could help to for example maybe guide binning prior to assembly and i don't know it looks kind of cool and it gives you a kind of no view of what your sample might look like in terms of its kima distributions so that's kind of a cool tool in terms of function um i don't know how many of you are reading these metagenomic papers it's not quite so bad now but maybe about three four years ago most of the functional annotations or the functional analyses that you got backward either just lists or groups of keg pathways that are up down regulated all these very simple bar charts here that might be based on things like gene ontology or cog annotations so these don't tend to be very informative i don't think that they're particularly interesting and part of the problem is that the categories are very broad so it's been more interesting kind of um placing genes into the context of the systems in which they're operating so we know that genes don't operate in isolation they're actually part of these highly complex pathways or complexes that are performing some common function so we can think about placing our data sets in the context of for example protein complexes metabolic pathways or even signaling networks and so if we can start placing our metatranscript tonic reads in these contexts we might get a better idea as to what are the functions that are really being co-ordinately um up-regulated or down-regulated in in these particular samples so these were two tools that we mentioned yesterday mg rast and megan so these picture these kinds of keg-like pathways that you can layer on your metagenomic metatranscript tonic data sets um metabolic pathways are very good because they're very highly conserved we know a lot about them we know what the organization of the pathways should look like thanks very much to keg so these give us a quite intuitive way of uh understanding how reads and uh transcripts might be mapping and expression might be mapping in the context of these pathways one issue i have with these though is that these present pathways in isolation they don't show connections across and between pathways so we know that some of these enzymes here are producing substrates that might be utilized by another pathway somewhere else you can't see that through these kinds of simple pathway based analyses and so we prefer to um take more of a network-based approach so part of the reason for this if we if we go back to these metabolic pathways keg-defined metabolic pathways the way that these were annotated and curated in the first place is they were largely based either on ecoli data yeast data or human data and so we're dealing with these microbiomes which have all these different bacteria in there which can have all these other different pathways that may not be captured through these particular standardized keg pathways so we regenerate these kinds of global metabolic networks if you like and then you can start layering on um your metatranscript tonic data so here we can combine taxonomic and functional annotations so these different enzymes that we've identified in a particular metatranscript tome and then these pie charts represent the breakdown or the contribution of each of the different taxa that we have that is contributing that particular function within this particular pathway okay so this really gives us an idea of what genes what enzymes what pathways have been upregulated in our particular sample and what are the taxa they're actually responsible for doing that upregulation um another type of network that we can use the type of network is are these protein interaction networks so these offer additional scaffolds that we can also use to interpret our metatranscript tonic potentially metagenomic data sets as well so this is a a protein interaction network of genes involved in cell wall biogenesis and this is remapped on contributions from a mouse gut microbiome so we can see that these three genes here there's a large contribution from this purple taxa which i think are bacteroids so maybe bacteroids are producing a lot of these um your senior gene uregens and the levels of peptide glide in the so peptide glide can associated with those particular enzymes one issue with these kinds of data sets is that these rely on the fact that something's generated these protein interaction networks in the first place these are pretty limited to only a few taxa at least the high confidence ones are there are functional networks that you can get from the string database which captures a whole bunch of different taxa but they're not particularly high quality um the high quality ones are really limited just to a few taxes such as ecoli so now you're reliant on homology mapping you have to map your transcripts from organism x onto an ecoli gene in order to be able to create these kinds of views okay so that's visualization what about statistical considerations again metatranscriptomics lagging behind metagenomics lagging behind uh 16s data sets there's really no dedicated software statistical tool for doing comparisons of metatranscriptomic data sets so we're really reliant on existing tools for RNA seed data sets number of biological replicates as i mentioned we need at least two preferably at least three but these are expensive experiments so this is really reducing people's options in terms of how many biological replicates and what kind of questions can you actually perform power analyses again no idea how we're doing those at the moment one thing we can start doing are applying these um kind of RNA seed like tools so dc2 out x2 to identify differential expression of individual genes if we can identify genes that are differentially expressed then we can do these kinds of gene settlement analysis but i think ultimately given the problems that we have in the statistics and the lack of power that we have with uh these kinds of data sets lack of replicates we should really be viewing these metatranscriptomics experiments more as hypothesis generation um that we should be following up with as as someone mentioned earlier with these q pcr kind of experiments to actually verify in a larger kind of sending large number of biological replicates that what you're seeing really is true just to show you the kinds of analyses that you can do again you can do pca plots so this is for our 16 mice uh guts that we did earlier plin2 versus wild type high fat versus low fat and we find with these pca plots um this is taxa there's no differentiation between the green and the red here we see significant differences between the green and the red the red are the plin2 mice for the high fat diet the green are the wild type mice for the high fat diet we see a similar shift as well a significant variation between uh plin2 and wild type high mice um mice for the high fat diet and also under pathways as well so through this kind of plot we're starting to see that while there are no differences at the taxonomic level there are differences at the transcripts enzymes and pathway levels so there are significant functional differences between these two different sample types then we can use uh dc2 aldex2 these kinds of differential expression um gene expression uh tools uh to identify these genes which we can then put into genes and enrichment analyses so we can perform hypergeometric tests and as i mentioned earlier under a high fat diet the plin2 mouse wild type mice you can identify 22 keg pathways that are enriched in these differentially expressed genes then once you can identify these took the pathways so glycolysis gluconeogenesis then you can go back to your visualization and see what does this actually mean and i think i kind of like this as a way of kind of convincing you guys i hope in terms of can we actually get something meaningful out of these kind of studies or is it just all noise and what was nice from doing this study is you'll see these large set of these consecutive enzymes so these are performing consecutive reactions in the glycolysis pathway these are all downregulated in the plin2 mouse relative to the high fat mouse so this appears to be a coordinated differential expression downregulation of these enzymes in the plin2 mouse and again we think that this is potentially due to accumulation of triglycerides if you have a lot of triglycerides energy isn't such an issue you need to start changing flux within your energy production pathways to producing biomass instead you've got sufficient energy produce more biomass you can grow quicker and i think that's it so i'm happy to take any final comments questions who's interested in doing metatranscriptomics now yes awesome yes and that or maybe maybe one idea would be to use micro rates instead yeah i think with microwave you wouldn't you wouldn't necessarily care because if you design the chip carefully enough the hybridization should only occur between the transcripts that you have in your pathogen relative to all of the other transcripts that are in there so that would be one advantage of micro rates compared with RNA sequence but presumably you know what the pathogen is so it's relatively trivial to sequence that now and to construct a micro rate on the basis of that yeah exactly yeah just i read a paper on chemical bacteria that was when the arteries didn't actually IP the RNA after uh right after the gut and then after that they use only the next platform but it was very risky that's what i heard and another one that would be to like target capture this synthesize all these sequences on these for you to fish up or single cell genomics that's very trendy at the moment all right so i'm going to yep thank you very much uh there's also flow cytometry as uh you can actually if you make an antibody against it you can fish out the bow from the stool literally the people who run the for cytometry facility may not like you much but that's feasible and there's other ways uh also that can that can be done yeah all right