 Alright, so just to introduce myself, my name's John Parkinson. I'm a scientist at Stick Kids Research Institute here in Toronto. What I'm going to be talking about today is metatranscriptomics. So just to get a bit of feedback, who here has done any metatranscriptomics? I'm not seeing any hands raised. Okay, who's heard of metatranscriptomics and who's interested in using metatranscriptomics? Okay, okay, good. Alright, so yeah, metatranscriptomics, I guess, it's been I think about six years since we published a paper on metatranscriptomics. This still doesn't seem to be that much that gets published on metatranscriptomics. So it still seems to be something that people aren't really adopting at the moment. So I'm hoping by the end of this kind of this lecture, this tutorial, it will give you maybe a bit of a confidence and idea as to why you might want to think about pushing metatranscriptomics more in your own research. Alright, so let's just go over the learning modules. So the idea is that I think at the end of this module we want you to have a good understanding of the capabilities of metatranscriptomics, what it can do. I want to outline some of the issues and challenges associated with metatranscriptomics concerning things like sample collection, experimental design, and then we're going to go through the various steps in terms of processing the data. And it's relatively straightforward. It's very similar to processing metagenomics data with a few kind of caveats, but we want to walk you through the pipeline and that's what the tutorial is really all about. And so by the end of it you'll be able to, and the tutorial will take you through the various steps of processing a metatranscriptomic data set. Alright, so why metatranscriptomics? So 16S Ribes-Omelin RNA surveys, they're great for telling us who is actually there, but they don't really give us much in the way of mechanistic understanding. So this is a study from a colleague in Colorado that was published in 2008. And this is a breakdown of taxa from a normal gut, from an IBD gut, and you can see that there's a difference in terms of the taxa that are there at different points in the gut. I don't really have an idea of what it means, so you don't know if that's actually a causal relationship or whether there's actually any functional consequence associated with that. So subsequently people have been turning to metatranomics, so metatranomics is where you do all the DNA in a sample, you just don't use just a microgene, so we covered a couple of lectures on that. And here you're getting an idea of what can the microbiome actually do. Okay, so I like this study, this is from 2012, this is from the Human Microbiome Consortium, there's about 120 different individuals, they sample from six different sites, and then they perform a taxonomic breakdown, which is the top one, and then a functional breakdown in terms of metabolic pathways based on metagenomic data. And what I want you to see is across each of these body sites, each individual seems to vary very dramatically in terms of the different microbes that are actually there. But when you look at the actual functions, the sites are very specific, so in a way it almost doesn't matter what bacteria you have there if they're all giving you the same types of functions. And they're moving into metatranscriptomics, so this is kind of an excerpt from metagenomics, if you like, metagenomics is giving you the functional potential, but metatranscript is actually telling you who is active, who is doing what within your particular microbiome. All right, so we iterate that metatranscriptomics really focuses on community activity, and the basic idea is we've all heard, who hasn't heard of RNA-seq? All right, good. So metatranscriptomics is basically RNA-seq, but you're applying it to microbiome rather than an individual. And the idea is you're getting these kinds of readouts where you can map these kind of gene expression profiles onto these kind of functional modules. So this is a set of genes that are involved in cell wall biogenesis and then the shapes of these nodes, the size of these nodes, indicates the relative expression, so the larger nodes indicate that there's a lot of that gene being particularly expressed, by placing it in these kind of networks, if you like, these functional module networks, you get an idea of what are the particular functions that have been up or down regulated in a particular sample. In move, next step, this is the same kind of graph view, but here we're now representing each of these genes by a pie chart, and these pie charts are now giving you a breakdown of what are the taxa that are responsible for expressing each of these different genes. So you can really, these kinds of tools, these kinds of visualizations can really give you an idea of drilling down into your data, understanding what are the pathway differences between samples and potentially what are the taxa that are contributing to those pathway differences. So this is really kind of the output that Metatranscriptomics is able to give you. So just to illustrate the capabilities of Metatranscriptomics, I want to just have a short vignette on a study that we published last year, which was focused on a model of obesity. So this is peri-lipin 2. It's a gene, a mammalian gene that's involved in fat absorption. So it's been found that when you do the plin 2 in mice, and you feed those mice a high-fat diet, the mice seem to survive that high-fat diet, so they don't have any of the negative consequences associated with the high-fat diet. And the idea is that peri-lipin 2, it's this protein that interacts with lipid droplets and enables a lot of fat uptake within the intestine. And so our question was, well, if you're going to have this plin 2 knockout, you're not absorbing as much fat into the host, so what is the consequence on the microbiome? So this was our experimental setup. We've got 2 gene in the types of mice, we've got wild type, we've got a plin 2 knockout, feeding mice a high-fat diet, low-fat diet, and we're using 4 mice in each sample. So this gives us a number of replicates that can help boost the significance of some of our findings. And I'll touch briefly on kind of sample numbers a little bit. So for each of these mice, we take sequel content, we then generate about 20 to 30 million sequence reads using RNA-seq from each of these mice, and then we push them through our pipeline. So the first thing we did is to look at what is the taxonomic abundance associated with these four different types of mice. And what we find is that while there are differences between the high-fat diet and the low-fat diet in terms of different taxa that are differentiated expressed, when you look at the diets themselves and the two genotypes on the same diet, we found no significant differences in the taxonomic breakdown. So the genotype doesn't seem to be having any impact on the types of bacteria that are there. This is kind of interesting because when we look at the actual transcripts and the transcripts expression, we do actually find that there are large differences in the transcripts that are actually being expressed. So you've got the same community across these two genotypes, but they actually seem to be expressing different functions under a high-fat diet. So this slide a little bit complex, but basically we're able to identify 200,000 bacterial transcripts across our samples, and this gives us a breakdown across the four types of samples, and we see that there's a core set of 78% or so of the transcripts seem to be expressed across all four samples. The more interesting graph is this one on the right-hand side, where we're just focusing on the highly expressed transcripts of those that seem to be in high abundance, and we see that each of these sample types has a significant population of transcripts that appear to be highly expressed that aren't highly expressed in the other sample types. So this is almost the reverse of what we saw in that previous paper from the Metagenomics Consortium, where you had different communities expressing the same function or encoding the same metabolic functions. Here we've got the same communities, but they're actually expressing different functions, and so this is why we think it's important to pursue metatranscriptomics because this is starting to give you some insight that just looking at taxonomic abundances just isn't going to provide. Any questions on this so far? Okay, so when we start looking at what these transcripts are involved in, we can map them into these metabolic pathways, we can do gene-set enrichment analyses, and we can find that out of 180 different metabolic pathways, as defined by the KEG metabolic pathway database, around about 42 of them seem to be enriched in these differentially expressed transcripts in the high-fat diet for the two genotypes. So the two genotypes under a high-fat diet seem to be changing the expression dramatically of a lot of their metabolic pathways. Where we're focusing on one of these pathways, so here we have glycolysis. The light red nodes indicate enzymes that are up-regulated in the plin-2 mice under the high-fat diet. These darker nodes indicate enzymes that are down-regulated by the plin-2 mutant under a high-fat diet. I hope you can see that there's this large chunk of this pathway here, and it's a kind of linear section which takes you from fructose, 6-phosphate, all the way down to pyruvate. This is all down-regulated in the bacteria associated with the microbiome from the plin-2 knockout. This is kind of a key part of glycolysis because this involves the reduction of a lot of energy, a lot of ATP, so it seems to be under a high-fat diet, the plin-2 knockout, the microbiome is actually down-regulating its ability or its functions to produce a lot of energy. And so this gives rise to a model such as this where in the wild-type mice where we have plin-2, the plin-2 is involved in lipid droplet formation, it means that the high-fat, the triglycerides are being absorbed by the host, which means you don't have so much triglycerides available for the microbiome, you just have sufficient to produce energy. On the other hand, under a plin-2 knockout, you don't have this plin-2 protein, you don't get the same lipid droplets being formed, you have an abundance of these triglycerides. This abundance of triglycerides means that in addition to having sufficient triglycerides in order to produce energy, you now start up-regulating some of these other pathways that are actually resulting in biomass generation, so formation of amino acids and all the components that a bacteria need to grow. So I just wanted to show this to illustrate how you can use metatranscriptomics to really start drilling down into potentially some of the mechanisms and generating hypotheses which you can now go back and actually test whether something like this is really happening or not. Okay, any any questions? All right, so how do we actually go about doing metatranscriptomics? So it's as I mentioned, it's basically RNA-seq, but we apply it to the microbiome rather than individual tissue. So here we have a mouse, we extract some of its gut contents, we extract the RNA from those gut contents, and this RNA is basically a set of transcripts from all these different species. These transcripts are then fragmented, you're sequenced and now you have the task of aligning all of these reads to known transcripts and then that gives you a relative expression and kind of a digital readout, if you like, of each of the different bacterial transcripts that are in your particular sample. So as I mentioned, RNA-seq is normally applied to a single organism where you have a genome sequence and you have a reference genome that you can map to. So one of the challenges that we're facing when we're doing metatranscriptomics of microbiomes is that we don't have a very good set of reference genomes and so doing this mapping step, doing this annotation, trying to understand what each of these reads belongs to, that's really the major challenge that we face. So just to reiterate that, in a typical RNA-seq experiment, normally you apply to, you carry it? Yes. Sorry to interrupt. No. So with respect to the energy-producing pathways, how is it known that those are energy-producing pathways? Has that been, is there some kind of reference? So I understand there's a keg database, but then has there actually been evidence of these being energy-producing pathways? In fact, there's a very basic question. So yeah, yeah, yeah, so that's, so no, no great question. So that part of glycolysis is a bit that's generating a lot of ATP. So that's by chemistry being worked out over decades where they can actually calculate how much glucose will lead to how much ATP and that's one of the major pathways by which you generate ATP in energy metabolism. And then with respect to the RNA-seq data, how do you establish the thresholds that are deemed to be significant in terms of op-regulation and then regulation? Is it a different threshold that one looks at with the microbiome as opposed to say human data? So, so yes, that's a very relevant point. So we go from identifying which these pathways significantly differentially express and we use a gene-set enrichment analysis there. So we take all of the transcripts and we compare them between the two samples. So in this case, it would be the plin 2 knockout versus wild type. And we use a program called the E-seq which is a program that predicts differential expression between transcripts across two different data sets. It's been widely applied to RNA-seq data and it actually works pretty well for microbiome data sets as well. So that will give you a list of those transcripts which are significantly differentially expressed. We use those in these gene-set enrichment analyses to identify which pathways seem to be enriched in these significant transcripts, differentially expressed transcripts. We then select pathways from this that we know there's glycolysis. So that was one of the ones that was significant and all we're doing here we're not actually now going back and looking at the significant transcripts. We're just looking at fold changes between the two samples. So this is just to look at what is actually happening within this pathway. Is there a consistent pattern up or down? So there are statistical methods that can pick out these kinds of pathways where we haven't applied them here. Here we're just looking at simple fold change. So but you can also probably break it down. We did and it does get quite complex and we've created those kind of plots and the ways that you visualize these data because you're starting to get into kind of multi-dimensional data sets that you want to understand what is the contribution from each taxon to each of these enzymes starts getting a bit tricky especially with these levels of coverage where sometimes we can miss enzymes and you get incomplete enzymes and so forth. So at the moment we're currently kind of aggregating together and yes again another challenge that we're facing is the consistency of pathway expression across different taxonomic groups. Absolutely. Yes, which one? This one here at the bottom? It's RPKM, yes. So the RPKM here is just referring to one sample so there's no fold change it's not relative to anything. The RPKM is just our normalized measure of expression and then we're using the RPKM values in this graph to actually come up with the fold difference so the the log fold difference is between the RPKM values. So I think we chuck out anything which has an RPKM less than five when we do these analyses. So again that's to do with DEC and you can set the thresholds as to what you kick out or not. So generally I think about five to ten is the level that people generally look at when they're examining the RPKM cutoff for expression. Okay, good, great. Okay, so just to remind ourselves, so in an RNA-seq experiment typically applied more in, it's simply been applied more in new karyotes rather than bacteria and the idea of RNA-seq is to get an idea of differential transcripts or alternative transcripts that might be expressed helps in genome annotations so you can identify where all the genes actually are. And to a lesser extent they're actually using it to look at differential expression across different tissues that's coming more into the fore and we're also seeing more and more of these applications in bacteria as well. However for microbiome samples we got a number of problems. So first of all because most of the RNA-seq is developed either on individual bacteria in isolation or on new karyotes. New karyotes have this very nice poly-A tail that you can select for. Okay, so one of the issues we find with RNA-seq and reply to bacteria is that these ribosomal RNA molecules are in really high abundance. They can represent about 95% of your dataset. Sequencing a 16s gene say several million times isn't really that informative and so you want to have some way of getting rid of all those ribosomal RNA genes. The second challenge we face is that these environmental microbiome samples that we're sequencing from we don't have a complete set of reference genomes and so mapping and identifying what each of these reads actually is what it's coming from is, as I mentioned previously, quite a significant challenge. Okay, so starting at I guess the beginning of the metatranscriptome pipeline sample collection, RNA extraction. So the first thing we have to bear in mind is that RNA, unlike DNA, tends to be a little unstable. So it can deteriorate pretty rapidly and so depending on the method that you use for storing the RNA that can have quite an impact on the types of transcripts and the kind of taxa that you can recover. So our suggestion is to process your sample immediately to extract the RNA and then you can stick it in the freezer. The next best potentially is to stack freeze your sample in liquid nitrogen and store it at minus 80 until you have a chance to actually do the RNA extraction. But if you can extract the RNA as quickly as possible that's what we find best. There was some talk earlier about use of RNA later. This seems to be a topic that seems to come and go in there and never seems to be much of a definitive answer but talking with colleagues in Colorado, they're kind of convinced that what happens with the RNA later is that it can selectively lyse some cells and it can also interfere with RNA extraction kits. And so we tend not to use RNA later in any of our samples despite the fact that it does seem to be very good at preserving RNA. We think that it's just the biases that it produces that we just kind of avoid its use. Another question that comes up is number of biological replicates. Fiona mentioned over the weekend that there are papers that she's now seeing that have been submitted to journals that are getting rejected if they don't have at least three replicates. So it's a relatively new field. The bar is pretty low. I would say at least two. Probably now we're moving to at least three. In our mouse model we did four. We recently did a large chicken study. We're looking at the impact of antibiotics and diatom chickens and there we've used five chickens per particular sample. So this seems becoming more of an issue but at the same time I would also suggest that it really depends on the question that you want to ask as well because you could just do one sample just as a discovery phase to see what is actually being expressed in your sample and then maybe identify some of the genes, some of the marker genes within that sample and go back and actually test those in a much more cross effective way. That's the biggest drawback to medical transcriptomics. It's inexpensive. So it's around about three or four hundred dollars a sample relative to 16s samples which I think Morgan's facility is around about 20-25 dollars a sample. So it's a significant order of magnitude more expensive and so that's really why people are kind of edging on the edging on the number of biological samples they can get away with. So going back to this idea of these kits and these sampling approaches I wanted to just mention this study that was published last year which was looking at different standardization different kits and the influence of different kits you get in this case this was just DNA. How many of you came across this this paper this this study? Okay so this was a contribution from a think about 10 or 20 different labs from around the world when they say okay we need some kind of standardization as to how we're doing microbiome otherwise we're not going to be able to compare from one study to another let's look at what all these different approaches how they influence the kinds of results that we're getting at. So in their particular pipeline you go from specimen all the way down to extraction storage, live preparation sequencing. They looked at each of these steps and saw and tried to identify which of these steps can actually influence your actual results and they find that pretty much every single step here is going to dramatically impact what you actually find. So there's a real need for better standardization and understanding what are going to be the best approaches the best standard approaches for analyzing these data sets. So these two graphs on the right here are looking at DNA extraction kits and there's 21 different kits. This top graph is substantive DNA which is relatively short so you see that that kit that kit and that kit produces relatively short fragments of DNA so they're kind of discarded and then this bottom one gives you how much DNA are you able to extract and then these kits 3, 4, 10, 12 whatever they're able to filter out these kits because they're just not very good at extracting very much DNA relative to some of these other kits. This is a breakdown of taxa that these different kits were able to recapitulate as well and here we have ground positives here we have ground negatives these are replicates so sample one sample two the replicates are pretty consistent so each kit if you give it if you give it replicates seems to give consistent results. But you can see that this kit here these two kits here are very good at extracting DNA for ground positives this kit here very good for ground negatives not so much for ground positives. So depending on even the DNA extraction protocol using that can really bias what you're actually able to recover so I just wanted to mention the need for this kind of standardization and thinking about what is the actual kits that you're using what are the approaches you're using and try and ensure that you're doing consistent approaches to what the rest of the field is looking at. Okay so back to RNAseq so I mentioned that bacterial messenger RNAs we don't have a poly A tail so we can't use that to pull down and select for messenger RNAs which means that we have all these very abundant ribosome RNAs. So fortunately there are kits available to do this so this is the one we use ribo zero gold kit this is their data so when you don't deplete you get 72 percent ribosomal RNA only 4 percent of messenger RNA when you deplete with ribo zero you can see you get much more messenger RNA relative to ribosomal RNA. So it's a pretty good kit however it does have biases so it's not very good at getting rid of ribosomal RNA from actinobacteria okay so again you're getting these biases that are coming in depending on how you're handling and treating your samples. However this this this is really kind of a gold standard for what we do for minimizing contamination from ribosomal RNAs. So I think when we first started six years ago there was a ribo minus kit not a ribo zero kit the ribo minus kit would only eliminate about 30 to 40 percent of the ribosomal RNAs now these kits are eliminating around about 95 99 percent so they're doing really a very good job of removing ribosomal RNA. Again now another cost kind of consideration is how many reads do you actually need to generate in order to say you've got a good understanding of the functional capacity or the functional expression within our particular data set. So this is a rarefaction plot four different types of microbiomes a DC, kimchi, mouse, cow, rumin and we're just looking at the number of enzymes enzyme classification numbers so this is kind of the enzyme capabilities of each of these samples and we find that around about five million reads or so is sufficient to kind of saturate what you're going to discover in terms of the enzymes that have been expressed within your sample. So we're using that as a kind of a guide for a minimum number of reads that you need to generate messenger RNA reads that you need to generate in order to get an understanding of the functional capacity functional capabilities of your particular sample. So this means that rather than using technology structures for 4.5.4 sequencing or MySeq which are very good platforms for generating quite long reads or packed bio now what we need is is is quantity we need millions of reads that we can map and get a good kind of dynamic range of each of the genes and how much each of those genes is actually being expressed. So with kits so with the kits like the Ryber Zero kits we're getting 25 we're now up to maybe 50 percent of our reads we can actually map to a messenger RNA so around about 20 million or so reads per sample is what we're kind of suggesting. Okay so you've gone through your sample preparation you've extracted the RNA you've made your kits you've done the sequencing so this is where the hard work now comes in it's actually doing the analysis and it's processing all those sequence reads to actually make something informative out of it. So this is kind of my lab spread and butter we develop these kinds of pipeline sets take these millions of reads and pass them through the pipeline and try to make some sense out of them and a lot of these pipelines and there's that there's an increasing number that has started to get published again very similar to metagenomics analysis pipelines at least at the beginning of the of the process so at the beginning of the process you have these kinds of initial processing steps you want to remove low quality reads you need to remove the adapter sequences there's host contamination and then in the case of metatranscriptomics there's a ribosomal RNA that you need to identify which may have not have been removed by the kits. Okay so those are fairly standard those are little tools that you can plug in and you stitch them together within your python or perscript or whatever and you come out with what you hope is a set of reads of putative messenger RNA origin and now this is where things get a bit fun because you then have to try and map those putative messenger RNA reads back to some kind of transcript which can give you some kind of information about what he actually came from. So the first thing we do is an assembly method we find that if we're using a machine like Illumina high-seq you're generating reads at about 100 125 base pairs in length we find that those are quite short and it can be tricky to actually annotate those so if you can the magic numbers around about 170 200 base pairs gives you a much greater ability to actually annotate them. So we do an assembly step to see if are there reads that are coming from the same transcript that we can piece together and we can use those to create a longer kind of contig if you like and then you can use those to get a better idea of what what the original transcripts it derived from. So there's this assembly step and then there's a whole set of annotation searches where we once we've done the assembly we pass each of those contigs each of the unassembled reads against a series of sequence similarity search pipelines to try and map all that information to know and to know in genes. So hopefully by the end of this step what you have is a set of known transcripts bacterial transcripts you'll have an expression amount associated with that transcript so you'll know how many reads and mapping to each of those transcripts. You now want to put those in some kind of functional context and so there's a whole bunch of tools that you can now plug in you can push your transcripts through something like an enzyme annotation pipeline to identify which of these transcripts are enzymes and then start piecing together the pathways of that. You can you can add all sorts of tools depending on what you're particularly interested in so we've recently started looking at antimicrobial resistance genes there's databases out there like the card database that you can again take your set of transcripts pass it against the card database and identify all those transcripts associated with different antimicrobial resistance mechanisms. Okay so these are kind of if you think of them as kind of tools that you could slot in yourself depending on what is a specific question one of the specific sets of genes that you're particularly interested in and then once you pass it through these tools you there's different kind of visualizations or statistical methods that you can then use to integrate those data sets to actually turn it into some kind of more rational informative set of data to tell us what is happening about our systems of interest across these different data sets. So for example in the case of the mouse samples we might look at enzymes and then those enzymes back into those pathways and then we can use some kind of visualization tool to actually look at what is actually happening at the end there. So in the tutorial we'll actually take you through each of these different steps to give you a bit of an understanding of what each of these will actually entail. The dataset that you're actually going to be using is relatively small I think it's only about 100,000 reads. We recently in our chicken study I think there's in the region of about three to five billion reads in that and so computer clusters are really key to this and I think it's taken us about five months to go through about 80 metatranscript homes. There's lots of organization piecing stuff together I think ideally we could probably do it in about two months but these are big big data sets that require a lot of processing a lot of computation a lot of organization in order to process from start to finish. It's cool data. So any questions on this so far? Okay all right so I mentioned there are other pipelines one called Sansa. This one doesn't do an assembly step it relies on this MG RAST tool which has anybody used MG RAST? And what do you think of it? Well that is tricky. What did you think of the quality of the data that you got out at the end? Okay so there does seem to be a bit of a backlash against MG RAST it's been around for I guess quite a number of years and people are very happy with the kind of the quality of the annotations that are coming out of that so it's a nice plug and play kind of tool that you can just upload your data set and it gives you a result at the end. Whether that's good thing or not is really up for debate because you don't really know what's happening behind the scenes some of the quality of the annotations get can be a bit dodgy as well. So you're much better off if you if you really wanting to dig into specific mechanisms and properties that you're interested in doing those kind of annotation things yourself and that that would be the suggestion yeah. Are there any gene expression arrays for that? So I think I'll probably address that in a couple of slides time but not really. Just the sheer level of diversity kind of precludes that. All right so first up in the process filtering the reads so Trimimatic. Has anybody used Trimimatic? Yeah well good some people actually heard excellent. So Trimimatic fairly standard tool does a very good job uses this sliding window approach to remove low quality sequence data. There's lots of different methods for doing that. Again these pipelines you can think of them as works in process you have your kind of your Python script that's running through each of these steps and then you can add in the tool as they get improved over time. For filtering the Ribosome and RNA this is this is our our current curse. So we rely on this tool called Infernal. Now a lot of people have you see they publish their methods is using this sort me RNA and sort me RNA is really good because it's fast but it's really not that good because it misses about 50 percent of the Ribosome and RNAs and so that's why we go back to Infernal because Infernal is a much more sensitive method for actually detecting Ribosome and RNAs within your particular sequence. The downside is that it's 60 slow and it's probably the slowest part of our pipeline but we also have to start thinking about as the kits get better at removing the Ribosome and RNA maybe this Infernal step is one that we don't even need to think about worrying about too much in the future. Okay so I mentioned that we need to do read assembly this graph here is just to show you that as you get into this kind of range 180 200 base pairs your ability to annotate sequences really gets very good so doing the assembly step we think is important because it gets you into this kind of size range where with the context you can actually identify what the source of that content can come from. A number of methods for doing assembly previously we've used Trinity with now starting using spades which seems to be doing a pretty reasonable job of assemblies. One of the problems that can occur within assemblies are chimeras but in practice we find that these chimeric sequences really again seem to be quite minimal relative to mostly assembled reads so we don't normally worry too much about these chimeras. I should mention that just as an anecdote we were using Trinity on one of our first data sets we came out with one of the top hits in one of our samples and we were looking at it and I thought wow this is great this is huge assembly it's about 45 kilobases that Trinity was able to assemble just from these messenger RNA reads and we start looking at everything oh this is interesting it's this phage genome and it's from this deep sea and we're thinking oh maybe there's a lot of transfer between these different organisms within this deep sea environment and this phage is doing that looking a bit further it turns out to be Phi X and Phi X is the spike in that you use during the sequencing so that was a little disappointing but on the other hand it shows what a really good job these assembly tools can actually do when you're combining all of these reads from a messed up kind of genome from phage and the assembly tools can actually recover that irrespective of all these other reads that are in there so these assembly programs really do a phenomenal job okay so ideally what we've done is we've got a good set of messenger RNA reads we've done our assembly so we have set of reads a set of contigs okay so our assembled reads or unassembled reads we've mapped them to these full-length transcripts what we need to do now is we need to do functional annotations of these transcripts so we need to first map these reads onto the transcripts so how do we go about doing that so we can use methods like BWI and BLAT so we can download all of the genomes that we have available from the NCBI and there's about 7,800 of them at the moment there's a whole bunch of scaffolds and unassembled and not completely assembled stuff that adds up to about enough 20,000 genomes or so and you can try these tools BWI and BLAT and try and match your sequences to each of these genomes but these tools really rely on almost perfect matches and the problem is is that there's a lot of diversity out there so this is a study from I think 2007 or so Streptococcus acylacti and what they did is they did this rarefaction analysis to see every time you sequence a new genome of Streptococcus acylacti how many new genes do you get and so the idea is that when you sequence a new strain there's always going to be a whole bunch of new genes that you haven't seen before so that's going to lead to genes that you're not going to be able to map to because you've just got one reference strain or maybe 10 reference strains you don't have all the reference strains around there the other problem we find is that the diversity at the nucleotide level because of this third base pair wobble you can have the same protein sequence being generated by different strains but then because the nucleotides are varying at that third base pair I mean because it just encodes the same protein that's going to vary from strain to strain as well and so that strain diversity associated with both the nucleotide diversity associated with these new genes that you're just not capturing this is why you're simply not able to use these kind of microarrays specifically for bacteria because they're not able to capture that huge amount of diversity that we're going to see from sample to sample so instead we can use a tool like blast so this is a typical magical 71 base pair read and you can see a problem here the e value is not e to the minus 39 it's 39 and for those of you who've used blast that's the kind of pick that you throw away however when you look at the actual match it doesn't look too bad when you look at the species is coming from it's seems like sure that's probably where it's coming from so rather than looking at e-values and this e-value is a little bit ridiculous because it's a relatively short read we can start thinking about using things like percentage ID or percent of read length as indicators or cut-offs to say what we think is a good read is a good match so in a way it's a less sensitive version or a less precise version of BWA but you're still trying to match as precisely as possible to know in transcripts okay so what we end up doing is we use this tool called diamond and diamond's been a real saviour because blast is again quite slow it's one of the steps that really slows us down in processing these datasets diamond speeds things up by about a hundred to a thousand fold so this is really made processing a lot easier and research against protein databases and this enables us to come up with pretty good ability to annotate so these are eight different mouse samples large chunk this is the breakdown of what you find in these samples so a large chunk here is adapter or low quality so for some reason we joined the library preparation and sequencing with a large chunk of low quality sequences adapter sequences so it's good to have them move those the ribosomal RNA is this blue bar here and that's actually pretty low so these again just to emphasize these kits are doing quite a good job of really reducing the amount of ribosomal RNA that we're finding in our samples and then these green bars here these are our putative mRNAs this dark green bar are annotated reads so we're actually doing a pretty good job now of being able to match our reads to known genes and then this small section here it's about 10 15% these are the reads that we simply couldn't annotate through anything for one reason or other yeah so these sorry this is a microbiome from mouse it's not so the host that there's that there was really a minor amount of host that was in there so what we do is we take the set of mouse transcripts and we use that in a filtering step and we do by bwa against that set of mouse transcripts as a way of identifying and removing those host sequences the red one uh yes interesting it's not listed here is it that's that's a secret DNA no I have no idea it's potentially hosting it's been miscolored okay so again as we're getting more of these uh genomes again in sequence we're getting a better and better ability directly doing some kind of annotation now one thing I should say is going back to this kind of blast now the reason we kind of feel we can get away with this is because we don't actually care about the specific species at this point we're more about the function and so it doesn't matter if it's matching to one bacteria or another bacteria at this point we just want to get at what is the function associated with these transcripts so in order to get at the taxonomy and identifying which kind of tax on our responsible we have a whole different set of methods for annotating um taxonomic information to each of these reads so we now have these transcripts that we've identified that these reads map to and we need to do a normalization step for expression so returning the expression levels into these reads per kilobase of transcript max rpkm so rpkm is a normalized level of expression and the main idea behind it is that you could have genes of different lengths and you can have reads mapping to these different lengths of genes and you might expect that this gene is equally expressed when it's this gene but because they're different lengths you're going to be sampling much more from this one than you are from this particular gene if they're present in the population so what the rpkm does it normalizes on the kind of gene length and so you end up with an rpkm value which shows you that this gene is actually much more highly expressed than that gene okay so that's that's really the main idea as to why we want to do this this kind of normalization uh so that's great we've identified a set of full length transcripts we have a whole list of transcripts we now have the rpkm the relative expression of each of these transcripts what we want to do now is to actually identify functions associated with these transcripts and so this is where you can plug in whatever kind of functional annotation tool you want so you could do something very straightforward and this is something that impresses me a little bit you can just map it into gene ontologies so these like biological processes and you end up with these meaningless pie charts that show you whether there's nucleotide binding or amino acid biosynthesis or something like that not very informative all right but a lot of people like doing them so sure go nuts but um you can be a bit more specific and start focusing on some systems that are more relevant to yourself so we've had a long-standing interest in enzymes we like enzymes metabolism because it's one of the best characterized systems that's out there so it's been studied for over a hundred hundred fifty years or so we know an awful lot about metabolism it becomes quite easy and and you feel quite confident about the results you get when you're mapping and identifying enzymes from your transcripts so there's methods for identifying enzymes from your set of transcripts recently we've been interested especially in that chicken study looking at antimicrobial resistance so we're using the card database and we're going to plug the card database in there's another annotation pipeline to see which transcripts can be annotated to which type of different antimicrobial resistance mechanisms uh another area that we sit in is biofilms sorry have a set of genes a set of gene families that we know are associated with biofilms again we can use that set to part our transcripts again to identify biofilm machinery within our particular data sets so again you can plug in whatever kind of subsystem bacterial subsystem you want and increase the accuracy of your annotations and your functional annotations that way so the one i want to focus in on is enzymes and that's because we do a lot of it and because uh we think that some of the tools out there aren't doing it quite perhaps as as well as they should so uh when you look at a lot of these pipelines so human too is a little guilty of this as well they rely on blast a large extent to blast so they're blasting a transcript or a gene against a set of swiss port enzymes and from our own analyses we know that that results in about a 50 false positive rate so blast is a really bad tool for annotating enzymes and the problem is is the amount of diversity of different enzymes um across different enzyme classes so at one extreme you have some like phosphate glycerate kinase and these are the alignment scores where we take in all the different examples of phosphate glycerate kinase and we blasted them against each other and we see how well do they discriminate from everything else and these are hits to non-kinase and you can see that the alignment score is very good so you know if you hit one of these genes it really is likely to be the phosphate glycerate kinase and that's what these kinds of hard charts these are positive hits these are negative hits there's no overlap if you hit one of these enzymes you know that's pretty good you're gonna hit it at the other extreme you have serine 3-line kinases you see there's a huge overlap between the positive and the negative hits so if you hit one of these enzymes you really don't know if it really is an enzyme or something else that isn't an enzyme and then you have a whole set of stuff in between where depending on the hit you have a certain probability associated as to whether that really is the enzyme or not so this is um suggesting that we need better methods more sensitive cleverer methods to actually make use of this and recognize that different enzyme classes are easier to detect than others so just as a plug this is a our own piece of software called detect this is a precision recall kind of graph this one here is ncp that's quite widely used i didn't put blasted blasted kind of somewhere uh this blue one pream that's that's pretty pretty useful not bad and then these are the detects two versions of detect we came out of a new one recently and then up in x we're starting to use these specific enzyme class specific cutoffs for predicting where an enzyme should be associated with that particular read or not so we think that our tool is doing pretty good the problem is it doesn't cover all enzyme classes because we in order to create these plots you need a certain number of enzymes to create a reasonable kind of distribution and we don't have it for all enzyme classes so in practice what we do this is the overlap of enzymes that are detectable by these different methods uh there's detect up there you see it misses 1200 enzymes that some of the other methods do and so in practice we use detect predictions and then we use an intersection of pream and blast so we don't trust either by themselves but we do trust their intersection the combination of their predictions if they both predict the same thing then we can be reasonably confident that they are predicting the correct enzyme okay so that's again just my little pet peeve that's when you're looking at enzyme annotations just be aware that some of the methods that are being adopted particularly if they're blast you can get a lot of false positives out uh so as I mentioned using something like blast enables you to get to the transcript but we're not using it for taxonomic information we think that there are more sensitive ways for inferring taxonomy of a sequence read and by blast and there's a number of methods that have come out from that so this is why we want to do that so I guess I kind of undersold the idea because we find that different taxa can give the same functions but we might actually be interested in knowing what is the specific taxon that is associated with an express gene so there might be an interesting pathway that's been expressed is there one specific taxon that's responsible for providing that so that's why we think going back and actually annotating the reason the taxa is going to be important so how do we go about assigning taxonomic information um so there's a number of tools out there a lot of them rely on e-reliant methods so BWA is quite good at mapping precisely to a genome but they can fail when you don't have a complete set of reference genomes uh other methods are based on compositional methods such as nucleotide frequency code on biases um so here we can split a sequence into sets of 3MOS and then you get a distribution of the frequency of each of these 3MOS within your particular sequence and then you can map that frequency profile to frequency profiles associated with a whole bunch of different genomes and that's what a lot of these kind of CAIMA-based approaches do they build these profiles and then they try and find a genome that has a similar profile okay similar kind of 3MOS profile the idea being that there's going to be things like similar codons that have been used between that read and between the genome that you're actually using uh so number of methods one of the best ones is actually this one called MBC which seems to do a really good job um at doing annotation but it uses really long cadence it doesn't use just 3MOS I think it uses something like 60MOS or something like that but it does a very gets very gets things right it gets some really right so it's very good clark and kraken maybe not quite so good um there's been a I'll just mention a couple of others there's one called kaiju um uh what to say about kaiju so kaiju is fast and and so a lot of people like it because it's fast it does have large memory requirements so you do if you've got a large data set that you're comparing against you do need a dedicated kind of server machine some friend down the hall who's got a big kind of server thing um but it's fast so there is that uh and then there's centrifuge that's so what one that came out I think 2016 or 17 again again it's fast so that's great um it's uh it has lower memory requirements because it has this I'm not going to go through this picture but it has this kind of quite clever way of being able to compress genome so if you find similar genomes similar sections of genome to another genome then you don't have to search against it so it does a very good way of pruning your reference data set to bring it down to a more manageable level one thing it does that is quite nice is that rather than saying that it's going to take the top it says well these four taxa here are probably equally likely so I'm just going to assign this read tool for these taxa and while that's possibly not intuitive at least they've been honest about what their results actually mean and that they're saying well I don't really know if it's bacteroids or if it's a clostridia or if it's something else okay and again it's fast which is great how well do these actually perform uh this analysis that Mobile Archie did last year uh compared diamonds the results of diamonds kaiju centrifuge from um for this mouse gut microbiome diamond and kaiju pretty similar uh that's probably not too surprising because both diamond and kaiju are working in peptide space so they're searching at the level of amino acids they're not searching they're not using nucleotide diversity to make their match centrifuge on your home on the other hand gives you quite a different distribution so it's kind of oh so which one really is correct is it going to be kaiju it's going to be centrifuge why is centrifuge giving you a different distribution you look at the number of reads that've been annotated diamond is annotated about any percent of the reads kaiju just over 50 cent centrifuge only 25 percent okay so maybe this difference is because centrifuge is just not able to annotate lots of stuff that kaiju and diamond can so we didn't know this when when you were seeing kaiju or diamond in the protein space actually some very close to you sometimes to equalize and close relative to equalize sometimes we may see some of them which is not that close to equalize still get pulled out by kaiju and diamond so just know the caution what you use when you search in protein space you know there are some very similar proteins yeah that's a great point and so we've done a I guess because we didn't actually know what the breakdown should be for these reads in terms of text so we actually got hold of a gold standard so uh sometime ago we sequenced uh mouse cuts and it's an asf um like who's sort of asf oh good so altered shedler flora uh so it's a kind of a standard microphone consistent about eight or 10 taxa that people used to inoculate uh mice germ free mice to actually have a defined microbiome that they know what's there I think uh two years ago they finally sequenced all of those 10 organisms and so we generated a meta transcriptome for that we know we can then map to those known genomes because this was done in the germ free mouse there's nothing else in there we can actually recapitulate a gold standard and so what we did is we compared a whole bunch of different methods as to how well they're predicting at different levels so as Will said because you can get um a when you work in peptide space you can get a lot of hits against different E. coli's maybe you can't annotate at these specific species levels so these different colors represent how are you doing at different levels so this blue here represents being able to annotate accurately at the level of bacterial order so you see clark and kraken are only able to annotate around about five percent of the reads from this sample which is a little bit suboptimal mbc is doing pretty well yeah up here this dark gray bar saying these are bacterial reads great here we have kaiju two two versions of kaiju and centrifuge yeah they're able to annotate around about 20 percent 25 percent of the reads which means you're throwing away 75 percent of your reads but they're fast so that's maybe why a lot of people are adopting them this is our own in-house tool there's a bit of a plug called gist um it's able to do pretty good at the order level but not not so well beyond so it's not kind of suggesting it's not saying I'm super great I can only uh I can annotate things that demonstrate it's kind of very honest and say well I can only suggest that it's able to identify things at the order level the problem is it's slow and people don't like slow so this is again a pet peeve, the current trend seems to be for tools that are fast irrespective of performance so this seems to be propagated by journal editors as well who seem to like publishing stuff that is fast because they think people are going to adopt it but I really would caution you against using these kind of fast methods unless you start looking at how well are they actually performing for your specific data set so when they publish their tools they run it against their pet data set and they may have trained it well for their data set and it might perform very well for their own data set but if you look how they perform in your own data set it might be doing a really poor job so just a word of caution uh the other thing to mention here is you could compare your messenger RNA levels to your ribosome and RNA levels so you get the 16 s that are filtering through that you chuck away you could use those as an indicator of what your taxonomic abundance should be looking like you could perform a 16 s profile at the same time as you're doing the meta transcriptomics run as well and you could use that kind of information to help bootstrap and identify what taxa should be present should be expressed in your samples but just a word of caution again uh 16 s isn't necessarily the same as uh meta transcriptomics so these are paired samples this is messenger right this is messenger RNA this is ribosome RNA the kind of profiles to get out aren't going to be exact there's going to be some taxa that might be in high abundance that aren't very active conversely there might be some taxa that are in very low abundance that are very active and are responsible for generating a lot of gene expression uh in terms of visualization um this is a tool you're going to be using the tutorial I think it's called crona um it's quite pretty I'm not sure that you'd ever use it in your actual paper but it's kind of fun because you can definitely see it zooms out and zooms in which you're kind of feeling for what kind of um what what kind of taxa um and relative abundance of those taxa in an interactive way uh in a kind of web browser that's kind of a cool tool okay so you've gone through a couple of months hopefully of processing your data sets and annotating these data sets and turning them into functions and taxonomy how do we go about exploring the end result how do we actually visualize the data to make some sense of it and so because these are very complex data sets you're looking at transcripts hundreds of thousands of transcripts we need some way of organizing them into something that makes sense to us and so this is where we think systems biology can come in so rather than focusing on broad functional categories such as the gene ontology categories or the cog categories these graphs where you see that 20 percent of you read to associate with nucleotide processing or whatever which kind of leave you a little bit lost you can start thinking about using some of these kind of systems used so we think of things in terms of functional modules so groups of genes that are working together to provide a consistent function and we think if you can identify sets of genes within these modules in the same module that are consistently being up or down regulated and that is a pretty good indication that that functional module as a whole is actually important or not important in your particular microbiome and there's an increasing number of these kind of functional modules appearing now for bacteria you can think of protein complexes organizing your genes into protein complexes metabolic pathways we've heard a lot about signaling pathways as well so there's these kind this kind of idea that starting about organizing your data in terms of something that makes sense to you in in in these kind of bacterial subsystems uh so there's uh several tools that enable you to place your data or we'll give you these outputs as part of their automatic pipelines so energy rest again we'll give you this kind of nice um seductive view if you like a pathway that means think wow this is great we've got all of this particular pathway here about kind of digging in and understanding that most of these enzymes might have been erroneously annotated uh but it looks cool and so you can have these kinds of network maps that kind of show you which enzymes actually present in your sample uh Megan's gone a little bit um higher and rather than having a single color you now have gradient of color where you can actually show the relative expression of each of these enzymes within the context of the pathway one of the problems I have with these kinds of maps is that these are based on keg maps keg maps are defined on the basis of three organisms I believe E. coli yeast human we're sequencing organisms that are not E. coli yeast or human and so these reference maps may not actually be that relevant in addition there might be connections between pathways that you're missing so this substrate here might be found here and so there might be an enzyme maybe interconverted between them and so there might be a chunk of enzymes which are moving along that kind of direction which may be upregulated but you won't see that because all of these pathways are kind of just visualized in isolation and not getting the global picture you're not seeing the interconnections okay so these kind of borders artificial borders that these kind of pathway views can be a little bit misleading so that's why we're kind of trying to advocate these kind of more network views so we in our lab we do a lot of these kind of metabolic reconstructions where you take a genome of an organism you identify all the enzymes and then you kind of reconstruct all of the pathways associated with that so each of these nodes here represents an enzyme and then we try and organize them into these particular kind of systems so the problem with this is that these are very complex it takes a lot of time to click and drag all of these nodes one by one and so there's real need for more standardized methods for kind of giving you these kind of global metabolic network views if you like they can enable you to look at some of the interactions across different pathways and in actual fact keg to be fair have actually come out with quite a nice plug-in where you can well so it's a so it's a tool called scientists who's heard of sign escape wow that's polarizing yes no wow so side escape is a really good visualization tool for placing genes for example in some kind of functional context so each gene might represent a node and then they're connected to other genes other nodes if there's some kind of functional relationship between them so that's basically what this view is here this is a side escape to each of these enzymes has been linked to other enzymes through substrates okay so it's kind of like a graph view if you like where you can start placing things together based on their functional relationships so side escape actually features a plug-in from keg where you can download the keg views and then you can add in your own data superimpose your own data to create these kind of quite nice views where rather than nodes just being colored according to whether they're present or absent you can now start thinking about taking the taxonomic information that you so carefully put together together with your enzyme information to create these kinds of pie charts and these pie charts can show you by their size the relative expression of the enzyme but also the taxonomic contributions associated with their n time as well so you can start seeing which tax were contributing to specific functions so these tools are pretty cool to use and it's the sort of thing that you just want to sit down on a Friday afternoon and upload one of these things and just play around and then think oh it's actually kind of interesting and it's a very intuitive tool and again as part of this tutorial hopefully you'll get a chance to use part of side escape and and generate these kinds of views beyond metabolism we can also do protein-protein interaction networks or protein complexes so this is this cell wall biogenesis kind of module so each of these represents an individual gene and then they're actually linked according to whether they're part of the same pathway or they have a physical interaction and so again you can use these as a kind of a scaffold to layer on your kind of metatranscriptomic data so the genes that they've been annotated to so the transcripts associated with cell wall biogenesis and then these pie charts again indicate the taxonomic contributions associated with each of those different functions and so you might find that these three genes here are pretty consistently expressed by actual readings okay so it starts giving you an idea of consistency of expression within certain taxonomic groups okay so just a couple more slides so I have a little bit of time so statistical considerations so metatranscriptomics again it's a relatively limited field at the moment there haven't been that many tools that dedicated tools that have been developed for analysis of metatranscriptomics so in terms of statistics how many biological replicates we're saying at least two for me at least three but we have to bear in mind that these are expensive experiments power analyses so there have been some papers out recently which are describing how to do a power analysis for 16s and I think even metagenomic data sets I haven't seen anything so far for metatranscriptomic data sets so that's all a little bit hand-wavy at the moment in terms of methods for identifying differentially expressed significance in the differentially expressed genes there's a tool called ALDEX2 which is very powerful but it isn't very sensitive so when you run ALDEX2 you don't actually get that many hits back because it's it seems to be very restrictive DC2 on the other hand which was developed for RNA-seq data sets specifically for RNA-seq data sets actually seems to perform pretty well for metatranscriptomics and there's I think Rob this afternoon is going to be talking about the benefits of things like ALDEX2 and DC2 being able to treat data as compositional data so we have to remember this is compositional data as opposed to giving you an absolute abundance and Rob will talk a lot more about that this afternoon and again I think most of the power of these analyses is if we can convert a transcripts into gene sets into these functional modules that gives us a much more reassuring view of consistency of changes in expression and we can use genes and enrichment analyses as a way of giving a significance kind of level of approval to whether these functional modules, pathways, protein complexes and so forth really are differentially expressed from one sample to another so ultimately everything in metatranscriptomics can be viewed as hypothesis generating so if you identify a transcript or a module that seems to be up-regulated in one mouse compared to another mouse maybe you want to go in designing your experiments and see in more sense of experiment whether that module really is up or down-regulated and you can get your statistical power that way but for the most part reviewing metatranscriptomics is really hypothesis generating. Just as for 16s and metagenomic data we can produce these kinds of principal component plots. This was just a fun set of plots from our original mouse study the plin2 study we find that there is no real significance in taxa except that they never diet but when we're looking at significantly differential expression transcripts we're looking at enzymes across samples we're looking at pathways then we're starting to see significant separation between the sample so again just to emphasize that you can have the same community but the actual functions that they're expressing can change dramatically just to mention again dc, hr, lx2 again these are pretty good for doing differential expression and then we can use them like hypergeometric assess, gene set enrichment analyses to identify groups of genes that have some kind of significance associated with their expression when you combine them all together and with that I think it must be way beyond time for a break and we will be moving on to the tutorial and the tutorial is one that we've been tweaking for the last two or three years it's a data set of 100,000 reads from one of our mouse meta transcriptomes and it will take you through the various steps of processing going right from the raw sequence data all the way down to these kind of side-escape views hopefully you can complete that in two hours hopefully you think side-escape's fantastic I should be using this even if I'm not going to do meta transcriptomics I should still be using side-escape as a way for visualizing a lot of my data anyway yeah