 All right, so I'm John Parkinson. I'm from the hospital for sick children. I'm a senior scientist there. I work on parasites, genomes, and more recently, energenos, microbiomes. And what we've been focusing most on in terms of microbiomes is metatranscriptomics. So for those of you out there, how many people are actually generating metatranscriptomic data sets? There's anyone? All right, three people. Superb. How many people are considering doing metatranscriptomics? That's very encouraging. And the rest of you, how many of you are thinking you're wishing that there was an earlier flight or something? All right, so I think what I'm trying, what I'm gonna try and do today is to convince you why you should be considering metatranscriptomics as part of your microbiome experiment. Okay, so these are the learning objectives of this module. So the idea is that at the end of this module, you will gain an appreciation of the opportunities, as well as the challenges that are inherent in metatranscriptomics. And hopefully do this by giving you some kind of understanding of the capabilities, what metatranscriptomics can and can't do, gain an appreciation of sample collection. So some of the caveats when you're doing the sample collection, some of the important things that you need to take into account when you're doing the sample collection, you also experiment with design. But mostly we're going to be focusing on this is the five minutes workshop. What are the important key steps for when you get your metatranscriptomic sequence data? What are the key steps when you're processing that? What are some of the problems, challenges that you can come across? And what can you expect to get from these data sets? So is it all a promise, or do you actually get something meaningful out of these kind of data sets? And then in the tutorial, what we put together is a relatively simple metatranscriptomic data set with a whole bunch of scripts and tools to take you through the various steps that we go through in our lab to process a metatranscriptomic data and to come out with some kind of view. So this is what we're kind of aiming for. All right, so the only view of the lecture, first of all, what is metatranscriptomic? How does it relate to RNA-seq? As I mentioned, experimental science sample collection preparation, but then going through this processing with the reads. So what are the filters? Assembly, why do we assemble? Again, I'm not going to mention the replication, that's the tutorial. And then getting into the function and potentially tax-alake annotation. And then briefly, statistical analysis and visualization. OK, so over the past two days, you've been hearing a lot about metatranscriptomic. So when we were starting to put this workshop together, all of the instructors were kind of scratching our heads as to, well, what do we mean by metagenomics? So is metagenomics purely DNA, whole shotgun DNA sequencing, or does it capture 16S as well? So there's still this kind of, I guess, taxonomy or semantics as to remember metagenomics is really focused on 16S surveys, or whether it should be focused on DNA. Irrespective, the points of metagenomics, whether it's survey sequencing or whether it's whole DNA shotgun sequencing, is that you're trying to get an idea of the community makeup, even in terms of the species that are there, or in terms of what genes are there. So in terms of the 16S, although it can inform on the species of the present, it does have several drawbacks. So it doesn't provide much in the way of functional context. It doesn't tell you what genes are important, particularly even within whole genome DNA sequencing. You can see which genes are there, but it doesn't tell you which of these are actively expressed. In terms of the 16S surveys, different sets of species, you can identify the species, but even if you identify those species, we know that different strains can have completely different complements of genes. So that, for example, two species from a lactobacillus, certain species of lactobacillus can vary by as many as 2,000 genes. And so knowing the species, what species are there doesn't necessarily tell you what genes and potentially the activity of those genes within that sample. So relying on 16S sequencing can provide you with species level information, but due to horizontal gene transfer, the gene complements, even across strains of the same species can vary quite considerably. So conversely, rather than metagenomics or 16S surveys, metatranscriptomics really poses on community activity. So here we're exploiting RNA-seq to really determine which genes, which pathways are being actively expressed within your microbiome sample. So the idea is that transcriptomics is relieving the active functions. And it might be that you don't even care what taxa response will you might just want to know within this microbiome, what is this microbiome actually doing from a functional perspective? We don't care what species are in there. And I can show why that might be an issue later on. On the other hand, there's also a number of tools that are starting to come out which can reveal which taxa are responsible for active functions. So potentially what you can get is a kind of a view like this. So here we have a kind of a network view of different genes that are involved in cell or biogenesis. And it can have a simple view where the size of the actual nodes represents the relative abundance of those genes within this data set. But if you can reveal which taxa are responsible, then you can get to these more complex kind of views where these pie charts are actually showing the relative contribution of different taxa to these particular functions. So for example, these four functions here, you can see that this, I think higher, I think that vectorialities in this particular case are responsible for those particular functions. So you can start highlighting not just rich functions have been actively expressed and which pathways have been expressed within your particular microbiome, but also what are the keystone taxa? What are the tax which are critical to ensure that you have a proper function in microbiome? So at this stage, does anyone have any questions on this first kind of word introduction? So I would encourage people to raise your hands when I'm going on, and if you have any thing that you want to clarify. Okay, so how do we apply RNAseq? So RNAseq, a lot of you are probably who is familiar with RNAseq? Okay, so so there's a few of you who are familiar with RNAseq. So RNAseq is really just the unbiased sequencing of an RNA sample. So here we might get some from a mouse, a mouse guy, which drags the RNA. And basically, you're just randomly sharing, fragmenting all of the RNA within that sample, sequencing it, and then you're mapping it back to the genome of your particular organism to identify which of the genes within that genome are actively being expressed within your sample. So it's taken over basically from microarrays around about 2008 or so RNAseq is really dominating this area of gene expression. So we're using microarray. We're now using relying on RNAseq to get an idea of expression levels of genes within the sample. Whereas microarrays give you more of a analog kind of output. RNAseq is here kind of digitally there to produce expression of transcripts. Yes. We're actually sequencing RNA. So it's typically RNAseq is typically applied to organisms with a reference. So it means that they've been sequenced already. So somebody's doing an analysis of cianogens and they want to see which genes are expressed in this worm under an opposite conditions, for example, and then extract the RNA and then perform the RNAseq experiment and then identify which of the genes are actually being expressed in that sample relative abundance, potentially isoforms as well. However, for microarrays, we have a couple of challenges associated with this. So how do we, what is the general idea here? So we can have our mouse, have a sample, a gut sample, extract the RNA and this RNA is going to have different relative abundances or transcripts from different species. So we'll have four different species represented in this mobile. And you're just randomly fragmenting and sequencing each of these transcripts. And then you align all of these reads that's generated from the sequencer to known transcripts. And that will give you a relative expression. So you know that six of these reads back to this particular transcript, so it has a relative expression of six, for example. So there's obviously a couple of challenges associated with this approach. So in a typical kind of RNAseq experiment, if you apply them to say new carrier, new carrier to organism, you isolate the mRNA, you fragment your sequence and then read the map to a reference genome. And then there's standard software such as Mac, VWA, which I think you've been using. This gives you support that the transcript is expressed and it also tells you what the relative abundance of that transcript is. And it can also tell you what the presence and abundance of different isoforms of that transcript might be. This is a typical view from my head. This is a gene here, this is a gene model. And then you map on this track here, this gene on track here, which shows you the relative abundance of reads and mapping for different regions of this gene. And this tells you that this particular gene is expressed at a certain level within this particular late set. So this works great when you have a reference genome for eukaryotes. However, microbiome samples are not eukaryotes and we don't have reference genome. So we face a couple of challenges. First of all, unlike eukaryotes, we've left out a polyene signal. So this makes it very difficult to isolate the mRNA from the ribosomal RNA. And this is a huge problem because ribosomal RNA tends to be in huge abundance relative to the messenger RNA. The second challenge we face is that environmental samples, we don't have the reference genomes. So we can't do this kind of mapping as easily or as effectively as we can when we know the actual genome. Okay, so this is the pipeline that we have for typical retitranscript tonic analysis. So like the RNA-seq experiment, we take the gut-like biome, we take the RNA, we make the sequencing, we predict the sequencing machine, which generates the reads. Once we have the reads, we have to remove the quality reads, we have to remove all the ribosomal RNA reads, we have to remove the host reads. And so the idea is that you have quite a lot of reads there, and then as you start going through these steps, you get fewer and fewer reads. Okay, so I would suggest that the number of reads there, the number of reads there, that relative proportion is increasing as we're getting better and better tools, better kits are appearing on the market, but there's still an issue of generating whole bunch of reads, and ultimately the bacterial mRNAs that you're getting out are in relatively low proportion. Do you know of any ways to enrich your samples in non-RNA? Right, so I'll get on to that, but there are kits that are available which do a reasonable job of actually selecting those things out, and I'll show that I think in the couple of slides. That's right, no, that's obviously a, I mean it's a huge issue and so we need to make one of the best. It seems like a huge waste of your increasing efforts. Absolutely. So once we have a 5% of bacterial fugitive bacterial mRNA reads, we then do an assembly, and I'll explain why we need to assemble, and then from this, from these contigs that we get from the assembly, as well as the singletons that don't actually map onto contigs, then we try and map these to known bacterial transcripts, and again this is a bit of a mission and I'll explain what we can do about that, and then finally we can map to pathways and produce these kinds of nice pictures that our boss is like because it shows that it looks nice and they can publish something, and then finally you can do some kind of sample comparison to show that this kind of pathway this sample is very different in terms of expression to for example another sample. Alright so we've mentioned that sample connection RNA extraction, so one thing we find when we are working with different groups who are generating these data sets is an appreciation or a lack of appreciation from particularly MDs as to what kind of samples we can get and how we should store them to make sure that we have high quality RNA. So we know that RNA quality really deteriorates rapidly within a sample, so the best thing that we can do is to process these samples immediately to extract the RNA and then store the RNA at minus 80. Actually the best best thing that we can do is to process immediately, extract the RNA, make the libraries, and then do the sequencing, but absent of that if you can extract the RNA and then store the RNA at minus 80 then that's great. The next best is to do snap freezing in the liquid nitrogen and then store it at minus 80. Yes? I was just wondering with respect to the have you compared that to just storing it precipitated here? No we have not. So we have tried 16, we did an experiment recently where we tried 16 different conditions of the actual storage of the raw material and then processing plus and minus of the RNA later. But we haven't we haven't looked at just storing the precipitate. No. John just have a question. Do you and so the RNA set work I'm most familiar with is on pure cultures just exposed to different environmental conditions. We use R in a later for that. Right. Is there an issue for that for this as it is for the metagenomics? Potentially, potentially. So I state here that RNA later you should really think about avoiding its use because it does lie some cells. So some cells some some of your bacterial cells are going to be more prone to the RNA later than others and so that's going to enrich with certain species but it can also interfere with these RNA extraction kits and potentially cause some biases in the sequences that you're actually able to extract. So it might be worthwhile during a comparison of the plus and minus RNA later to see what the actual super important one. We're working with a graph positive so we might victimize their use. Okay. And it still might be impacting the subsequent sensing steps. In terms of cost per sample it's not cheap doing better transcriptomics. So we would suggest generating about 20 million views per sample and I'll explain how he gets that number later. But given this requirement to generate about 20 million views we're at a region of about $300 to $400 per sample. So these are not cheap experiments and the main part of these costs is actually the library kits for generating the libraries for sequencing. So even if sequencing is going to come down in price the cost of the kits is still going to remain relatively stable and so we're probably not going to see this cost coming down to the same kinds of levels that we have for 16S where we're now at about $25 or $30 per sample. So these are kind of expensive experiments and as a consequence other people are very concerned about what number of biological replicates should we do. So we recommend that at least two. So meta transcriptomics very much in its infancy. There's not that many studies that have been published so far those ones that have been published very few of them have more than two biological replicates. Many of them don't even have any replicates whatsoever. And so knowing what number of replicates do in some kind of power analysis to see what kind of power you have in your experiment design this can be extremely challenging. So at the moment I would suggest that we're at a stage with meta transcriptomics where we're really applying it to do hypothesis generation. I don't suspect that we have the power at the moment to have a real capability of identifying individual genes that really are statistically supported in terms of their relative expression compared with between samples. Okay so going back to this issue of ribosomal RNAs and extractive ribosomal RNAs there are several kits that are available to remove the ribosomal RNAs so you need your starting material needs to be around about 500 nanograms to 2.5 micrograms of RNA material depending on what you're working with this may or may not be an issue. And this is one kit that we've had pretty good success with it's rather zero so it used to be a rather minus but it's now rather zero. So here on the left hand side of this graph we have an experiment where they didn't do any RNA deficient and get 4.5 percent which is mRNAs that are back to the genome and get around that 72 percent which is ribosomal RNA so you can see the relative abundance of these two species within the sample. However when you apply their kit you kind of get a complete swap so you does just according to their claims does a really good job of enriching 4 percent 4-messager RNAs. These are some samples that we've been processing and looking at and feeding for our pipeline. So there's various mouse samples here, cow ruin, kimchi, eepsie and fervor frost. And what we've got here in red these red blocks are the proportion of ribosomal RNAs that were identified from each of these samples. This gray one here is adapt to the quality sequence of fervor frost as a little bit sad here that 99.5 percent was low quality data. We also have host reads that we can identify. So these can be informative and can tell you what the host is actually expressing in terms of the environment. So they can actually be quite important. But the ones that we're really interested in are the bacterial mRNAs and then these read like green blocks. So we're in these kits so we've used the ribomindus kit here. I think it's in the beetroot and the varnit kit. There's a microbexpress kit. These were given as around about 20 percent mRNA reads from our sample. With the ribozero I think we're now up to between about 30-40 percent depending on the sample. Sorry. Yeah. I just have a question about I've used the ribozero kit in the past. And I was wondering if you had any experience running into I was working with an organ in that RNA and so the ribozero didn't actually come out those fragments. So we're working with the virus in that. Did you get an idea with the relative proportion of ribosomal RNAs that came out of that particular sample? There was quite a lot. Yeah. So did you tell the manufacturer about that of the kit? So yeah I think these things these kinds of caveats are going to crop up and these so the ribomindus came out in about 2009 the ribozero came out in about 2012. So it's going to be a continuous evolution of these kinds of kits. So there's the I think the latest one is the ribozero gold epidemiology kit that we're using for these. But again as you say there's we're going to come across these instances where these kits just aren't going to work particularly effectively. Rob did you? Oh okay. Okay so this might be better on your slide but so these are 16 samples that we took. We looked at four different variables temperature, storage, two different kits RNA later and each of these pairs represent in this case temperature of four degrees versus 20 degrees fresh versus one week ribozero versus ribo-lips breast kit and plus or minus RNA later. And so on the left hand side of each of these pairs samples that were stored at four degrees prepared at 20 degrees and unless you're storing it for about a week the actual storage temperature four degrees is so much better than 20. Yes this has got something to do with thawing the sample out the protein which calls for some detergent. Fresh is better than storing it for one week so then first of these pairs represents fresh so you get a higher number of RNA reads when you're just processing fresh that's supposed to be hit for a week. Ribozero kit is better than micro-lips breast this is the right hand side of these pairs in each case so in this particular case you get quite a significant bit of portability in micro-lips breast and then RNA later seems to work so again this is the right hand of these two of each of these pairs but again because of sample biases sequencing biases we don't really recommend the use of RNA later in particularly in microbiome analysis. Okay so that's as much as I want to say about sample preparation any further questions on that? So in certain time series experiments it's really impossible to have fresh what do you recommend in that sort of scenario? Do you think introduce a bias just based on that so say you have a time series experiment just running over a week because you're like into instance you're monitoring microbiome and you're going to say zero to two then four then seven So here we're storing at minus 20 so I think the suggestion is to store at minus 80 and then to leave it for whenever because you've got the considerations of the cost and you want to prepare all the samples at once so imagine to bring down the cost so store it at minus 80 and then when you're ready bring them all out at once and do the preparation all at once so we're facing initial the last one the day seven you're still storing it it's like putting it down to minus 80 to not need to use that bias would be one thing so maybe store them all for at least one week So this is an issue that we're facing with samples that are collected from kids suffering from malnourishment in Malari and so this is obviously an issue when you're collecting samples in countries that don't have necessarily the same facilities that we do in Canada and the US and so fortunately this is a welcome trust campus and so they have access to minus 80 freezers but if we were to think about getting store samples that are then stored by people as in their own homes then there probably won't be any point doing those kinds of experiments so you really do have to make sure that you can get these samples as quickly as possible to appropriate storage conditions Excuse me Yeah Would you comment more on the effect of RNA later? Is that within the same source or between for example environmental sample versus pneumatic sample? Is the effect of RNA later within the sample or Is the effects? So I'm not sure that there's been any comprehensive studies that have been done to look at the differences between different types of environmental samples so here this is this is from the store sample so this is information from a colleague Dan Frank who's in Colorado and his experience of RNA later is that when he's applying it to his store samples he finds that it's interfering with the processing of those samples Okay It's widely recommended for stuff that's in the ring Well Yeah, but use it recommended by So I think there's sufficient biases already put into doing these kinds of experiments that if we can minimize them and reduce these kinds of artificial constraints on the sequences that we're generating then the better So which sample was this? So there was a there was a male sample that we applied one kit to and we just didn't recover any para-bacteroidies 16s sequences and it was this kit and I can't remember what it was it might have been might have been one it might have been the Mervana kit or it might have been the Robo minus kit but it resulted in the selective depletion of one specific material even though it was an mRNA enrichment kit it removed effectively all the para-bacteroidies 16s but then some of the other 16s is that we're creeping through we're still able to keep through but then if you looked at the 16s distribution from that sample right then you saw a selective bias against that para-bacteroidy so it does make you wonder what other biases are these kits adding to your sample preparation All right so we have our sample hopefully it's a fresh sample it's a nice sample it's a lot of mRNA in there how many reads do we need to generate so this is a question that has kind of arisen and had a interaction for a few years now so we've been processing all these different data sets so it's a when you transfer it to deep-sea data set deep-sea data set mouse data set cow data set three of these were obtained from other labs and their depositors in the sequence repository archive at the NTBI the mouse data set we generated ourselves and when we processed them we were looking at the number of enzyme classification numbers that we could get within the samples and so what we're showing here is a kind of a rare interaction plot where as we add in an increasing number of reads which are these different samples we're seeing how many new enzyme classification numbers which we're using as a proxy for antibiotic reactions do we uncover at certain numbers of reads so 2 million, 4 million, 6 million and 8 million and you can see that as you're generating more and more reads you get more and more easy numbers but we would suggest that roundabout here which is about 5 million gives you a reasonable approximation of the actual functional capacity at least in terms of the enzymes that have been encoded within the sample so roundabout 5 million in RNA reads provides roundabout 90 to 95 percent of the antibiotic activity captures 90 to 95 percent of the entire activity within the microbiome and so now with kits conservatively able to get up to about 25 percent this is the ribo 0 pit this suggests that roundabout 20 million reads per sample is the minimum that you should be aiming for so from a cost-benefit perspective that's how we come up with this number of 20 million so a lot of metatranscriptomic data sets that have been published up to about 2012 or so relied on 454 and potentially so these can provide long reads and these can be very useful for annotation but they just don't give you the depth of coverage that you need so if you're thinking you need to generate 20 million reads per sample you think about replicas you think about time series you can see how the price of your experiment is going up particularly if you need 454 it's really out of the question because you can generate around 500,000 reads YC if you're generating around about 40 to 50 million reads so if you want to really have the capacity with suggesting high signal nexity and these are improving as well these are able to capture longer and longer reads but these are the machines that you can think about to provide the sequencing depth that you really need to do something so have you already covered multiplexing as well in a previous lecture one nod here oh okay so how many of you after meeting with this concept of multiplexing so does anybody want me to explain multiplexing good so the I so just briefly for those people who aren't familiar with multiplexing this is where you can add a barcode onto each of your sequences that you're generating sequence for and this enables you to combine many different samples within a single sequencing run brings down the cost and then there's biophanics pipelines that enables you to deconvolute each of those reads into their respective bins of the samples that they were driving yeah and do you know if this works true for soil I don't know how soil compares to deep sea but it's really diverse in this kind of so we had permafrost in there which is I think it might be about the lack of in the middle because it was a bit sad and it hadn't really looked like that so we don't know because we haven't analyzed a soil when a transcript turns so we can only go on the deep sea which was a very rich and very diverse environment so we would suspect that the deep sea is potentially going to be we don't know I mean it's depend maybe Rob do you know how diverse deep sea is compared with soil order of magnitude all right so so again we're limited in our in our estimation by the samples that we're actually processing it yes exactly yes there's also potentially a a compounding factor here that you can keep in circles in warm or what you're actually picking up isn't necessarily a real enzyme activity it's just noise in the system that you're getting a random match to a particular enzyme that's in your database so that's another consideration that you might want to bear in mind so when we look at these kinds of very low abundant honey seeds and what they're involved in it's not clear that they're playing any meaningful role within as you suggest these kinds of core pathways and they seem to be involved a lot of them are secondary metabolites and what's you know biologics or things that are just adding on and not necessarily connected to the entire metabolic network so it's not it's not clear how meaningful these these low abundance enzyme reactions that we're picking up within these data sets really are so again that's another consideration but if you are motivated if you're not motivated by cost then you can go 20 50 100 million but if if you want to get the best bang for your buck then we would suggest around about 20 million okay so we've generated yes so from practice you know when you carry it it's a different thing the amnesic you specifically do amnesic on a specific cost you don't worry you know there are more abundant you have a technique to smooth that but in mid in micro biotranscripts so you have a different cost there is an ortonology homology how you know where you build the cutting from the practical perspective how you you know know that particular idea aside from this one that one that one how do you sort out later on down the line from the practical perspective I never give a transcript from beta you know it's what I know for humans it's a different issue you are that is a specific cost so you know you even figure out down the line with it all at this perspective I don't know how you know the homolog if you sequence the entire thing from you know a pool of environment for example how do you know you know where that comes from like for similar games so so the question if and correct me if I if I misinterpret this but the question is once you've generated the sequence thing how do you know that what you're getting out is coming from a particular taxon yes yes is that correct yes yes okay so we'll attempt to get to that right later on in some cases you might not care and you might just be interested in identifying the functions associated with the microbiomes that you could predict maybe the metabolic capacity of that microbiome and that metabolic capacity might be into for example changing metabolism of the host which could then inform for example of disease a certain type of seed but in other cases you might want to know what those taxon what the taxa might be and there's a number of programs out there which do a reasonable job of assigning reads of taxon and I'll go through that by the end of the talk okay so we've generated our reads so we now want to go through the entire pipeline of processing filtering annotating from a functional perspective as well as a taxon like perspective okay so these pipelines are going very much in their infancy still very much being developed new tools are being benchmarked to see which ones appear to perform better than others in terms of analyzing these datasets not introducing biases so I'm going to introduce our current pipeline that we use in our lab but again this is an evolving so the first thing that we need is to remove low quality sequences so Trim-O-Matic was discovered in a previous lecture at all okay so Trim-O-Matic it's from a research group in France and it's kind of taken over I think in a lot of next generation sequencing platform to do trimming of low quality reads it's incredibly quick and when you do your tutorial see you'll just see how quick it actually is so basically it uses this sliding window approach from the five prime end from the three prime end identifies low quality sorry so it starts with five prime end then it looks for low quality regions and most low quality is occurring at three prime end and so as it's scanning across it uses this kind of sliding window and to make can I identify a chunk of low quality then it just says anything beyond that it's going to be low quality and so it removes it once it's removed sees what the size of the main read is if it's below 36 base pairs or that this is tunable and it's just discarded the important thing is that it's just sending you can write your own scripts to do this kind of thing but Trim-Matic is incredibly fast so really recommend using Trim-Matic it also has a capability of trimming of adapters so you can screen against the different sequencing adapters. So when you're doing the library preparation you're adding on bits of sequence an official sequence to read side of your sequences and sometimes these are actually captured in the reads that you're generating so these need to be removed so these are these kind of adapters you can also get vector depending on how you make a library you can also get Plasmid Vector sequence also within your particular material okay it's important to read those so Trim-Matic has that capability we've been using Cross-Match so the tutorial that we'll use will Cross-Match we've been using Cross-Match for 15 years so it's going to be hard for us to swap to Trim-Matic but Trim-Matic looks as though it's the new product but it seems to one that's out of the piece of the lot of things other programs in terms of being incredibly quick at processing through hundreds of millions of reads and then with all the rubbish that's in there we also have an issue with host material under certain microphones so again if you're sequencing gut material from a mouse you're going to have some mouse reads in there so you want to remove identify remove those kinds of post sequences so for that some like BWA and Blatt offer a lot of useful alternatives and again we have a lot of this ribosomal RNA this needs to be filtered out so we can use tools such as Blatt and Infernal so Infernal is one that we use because it's incredibly sensitive and it's able to pick up a lot of these ribosomal RNA sequences that generally doing just simple Blatt searches or Blatt searches just don't pick up the problem with Infernal is that it's based on a hidden mark of model profile search and so it's incredibly slow and so when we switched over into Infernal where we have a allocation on a super compute cluster at Toronto and we seem to be transferring through that very very quickly using Infernal so it does require an awful lot of processing power unfortunately like sort me RNA sort me RNA no that sounds like something we should explore it's kind of similar to the times of RF RNA okay one thing we've tried to do with Infernal is to break down the species of ribosomal RNAs that we're trying to capture because the Infernal database has a lot of these non-coding SN RNAs and so forth and they really slow things down but even if we just focus in on certain species it still seems to be relatively slow but we should have a look at your suggestion okay so we've filtered out all of hopefully the stuff that we don't want we're hopefully left with stuff that we do want which are these putative messenger RNAs from our bacteria the next step that we suggest is assembly so this is where we make more of that if you give them RNA leads you do a 70 step and the idea is that we want to assemble because it improves annotation accuracy so here we have content length sort of a point to 60 up to 80 up to 100 and then as we start getting over 100 the number of reads that we can actually annotate it gets very close to around about 80, 90 percent okay so once you pass over 150 base pairs or so you're in a very good area for actually being able to annotate and identify a sequence match to your read and so if we can assemble relatively short reads into these longer context it really does enhance our capacity to annotate these reads so if we're following the comparison of three assembly tools that we've already seen and trinity we find that trinity in terms of the percentage reads that could be mapped to a known gene using BLAST around about 50 percent of the reads after assembly of trinity and as to the data set I think this was a mouse got data set around 50 percent could be annotated after we've gone through this assembly step so it really gives us better capability to be able to annotate these reads now yesterday there was a comment that primaries can be a bit of an issue particularly if you consider that a lot of these MRNAs are going to be homologs from maybe closely related species and they could all get merged together we've done some testing on this using simulated data sets so that we can actually identify what what truth is and again looking at a couple of different diverse data sets one with ten attacks or one with maybe about 100 or 200 attacks are in and we find that primaries don't seem to be that much of a problem we get around about three or four percent I think of the context that we get out look as though there might be chimeric and again if you're not too worried about knowing the specific contributions of the tax within your sample and you're more interested in the functions this probably isn't really a huge issue anyway so it's only when you're interested in knowing what the specific tax within your sample are that this becomes a problem what was interesting with Trinity when we applied it to the deep sea data set is the most abundant content that we got out was it was a very long content it was around about 40,000 base pairs or so this was fantastic Trinity's done an incredible job of assembling this amazing read that we found in the deep sea so when we looked at it and it was getting excited and some kind of phage it's kind of like well phages incredibly abundant within the deep sea data set and then it turned out it was phyax so for those of you who are familiar with sequencing phyax is who are familiar with sequencing the phyax is the actual spiking that you that you do when you for quality control persist during the sequencing set but nonetheless I think it showed the power of applying the Trinity assembler that despite all the other rubbish all the shouldn't say rubbish but all the other reads all the diversity that was in the messenger RNAs within this sample this assembler is still able to regenerate and reassemble the entire genome of an entire phage so I think it it really demonstrates the fidelity that you can get with these kinds of assembly programs all right so on to the next challenge functional annotation so functional annotation if we were doing a genome a new genome you might hope that you have a whole bunch of collaborators you could go into the lab and explore the individual functions of the genes that you identify in that genome and do this experiment and so forth I think as you do more more genome projects people aren't interested in doing those kind of functional characterization so really I think the only way that puncture annotations are performed these days on any scale beyond 100 genes or more is really through these automated sequence similarity search tools and so we're reliant on tools such as VWA BLAT BLAST and all of the hybrids that are coming out which can improve the speed of BLAT so VWA BLAT these are extremely quick tools that can be very effective if you have a reference genome so we don't have a reference genome and the issue that we face here is that sequence diversity in these helpful if it's actually huge so this is again a reflection plot of an instructed copy of say Galacti this was 2008 they sequence 14 different strains every time you sequence a new strain you end up with a whole bunch of new genes associated with that strain and so I think this really points at the amount of diversity that you have that is actually out there so VWA BLAT which really rely on being able to identify near perfect matches just aren't going to be very suitable for these kinds of environmental samples on the other hand a lot of this diversity is really occurring at the nucleotide level so if we can work with protein at the peptide level then we can be a lot more successful so our solution has been to working peptide spaces use BLASTX so there are faster flavors of BLASTX it's U-search it's N-BLASTX it's DIWIND that we heard yesterday DIWIND we have to do some quality control checks over U-search unfortunately has some issues over the cost N-BLASTX seems to have some interesting issues over the quality of the matches that it's producing so while it's fast the quality of what you get out isn't as high to get from BLASTX so at the moment we're relying on BLASTX but this is obviously a huge problem so if we can switch over something like DIWIND then that provides us a greater capability so these are our five datasets mouse, QG, county of seed, permafrost and then we're looking at the proportion of reads that we can annotate with these different sort of DWA, BLASTX BLASTX so even with BLASTX so these are these orange lines you can see that these light blue segments here these are the ones that we just can't map and so you end up with only 30% of your putative messenger RNAs which you can actually annotate which is again a little bit sad when you think as to how much sequence that you've started off with and you're filtering and filtering and filtering and you only end up in terms of you see with only 30% less than 30% of your actual putative mRNA reads and you actually annotate something to it KIMCHI is a bit of an exception here this is because in the KIMCHI dataset they actually did sequencing of the eight major taxa that formed the KIMCHI community so they're kind of cheating with it because they have like reference genomes for the KIMCHI sample so one approach you might want to consider is in addition to doing metatranscriptomics maybe you want to think about doing a metagenomics run as well so you do a whole genome shotgun of your microbiome and then you can use that to do your reference mapping so any questions on this aspect of annotation just want to mention that when we do look at the matches that we get through BLAST so this is to come out for central and man's care read looks pretty good but you look at the expect value and it's 39 okay it's not E to the minus 39 it's a expectation value of 39 so the statistics behind this are obviously a little bit flawed our other species this is for mouse alfalfa bacteria and species this is about eight and so when we look at all of our BLAST matches we find that we get a large proportion which are covered in this region here and so we use this to define our cutoff we don't use ebay as a cutoff we say that we require a certain I think it's 90 or 85 percent sequence identity sorry it's 70 percent sequence identity over 85 percent of the sequence length so this is what we use as our cutoff for defining a sequence match that we'll use in our pipeline okay so we've got our reads we process them we've done some kind of annotation we've mapped them to some kind of known gene we now want to have some kind of expression associated with that gene that we've been able to identify so we're bar on A-C we have this term we've split the base of transcripts per million base pairs and a map so the expression is biased for gene-led so here we have a relatively long gene we have eight reads mapping to it here we have a little bit of short gene we also have eight reads mapped to it so because this is a relatively short gene you're less likely to find a read that's associated with sort of a short transcript and so you do this correction so the expression this RPKM is really trying to normalize and count these differences in the relative length of these transcripts and so when you look at the RPKM you see that this one is really the most highly expressed gene relative to the size of this kind of organization step so there's a number of software tools that are available to do this mapping and to calculate normalized reference so bow time, cut links we actually have a custom script that we use to actually do this ourselves so it's a relatively trivial calculation which we're showing here but oh okay okay I think it might say 10 to the nine in the notes as well that should be 10 to the six okay so we have the annotation and to the genes we have the relative expression what about tax and all the information associated with this mapping so how can we think about extracting that taxidermic annotation should we be using it so one argument as to why we don't necessarily need to worry about the taxidermic contributions is this was a study from the human mind consortium in 2012 so across the bottom here are different samples and different individuals and then this top graph you've got different kind of distributions and different kind of images in the samples and they've been split into these different kind of regions of store people with codes interior learnings and so forth and you can see in terms of the actual taxidermic composition there's a huge amount of diversity across each of these samples and so different individuals have huge variants in the actual taxidermic groups that are associated with different parts of the body but when you look at the actual functions so in this case they've looked at metaphylipid pathways and there's been a metaphylipid pathway purine sample is 5 to 6 so forth you can see that each individual pretty much has a similar complement of those functions so this is kind of suggesting that the actual community makeup isn't necessarily that informative as much as the actual function of competition so despite these huge differences in the actual taxidermic present within itself of the diversity that's inherent across samples the actual functions that all these different communities are producing are pretty similar from so this is one argument as to why we might not want to get so hung up on specific spaces on the other hand as R&A reads might reveal some kind of critical functions that only one taxon might be producing so-called keystone taxa another reason as to why we might want to think about assigning taxonomic labels to our R&A readers that it could help in a previous step in terms of binning reads and assembly and preventing the formation of these kinds of units okay so they're arguments pro and against doing this kind of taxonomic annotation any questions on so to identify so that you're not necessarily interested in that particular function so this is a ubiquitous function that's found in every single microbiome or no so we've been analyzing a lot of different metabolic pathways from a lot of different organisms and we find that even really being say one phylum and we've looked at about 20 different species of protists that are coming from the same phylum each of them are using subtly different ways of doing similar things so they have slightly different complements of enzymes to achieve even central metabolic processes so I think it would be pretty difficult to screen those kinds of common functions out because they do nonetheless form these important connections that bridge across yeah so that feeds more into the downstream systems analyses where you're trying to identify groups or groups of enzymes that are working together in one sample which is subtly different to the connections that you find in a different sample so I'll touch on that briefly hopefully correct me if I or bring it up again if I haven't okay so there are some tools out there that enable you to do some sort of taxonomic annotation and Rob's been working on a pipeline and we've been working on our pipeline and there's a couple of others that have been produced so one approach might be to use something such as Brass BWA but these can play a really lack super reference genomes and they can get probably confused as well so Brass BWA aren't particularly precise when they have been their taxonomic annotations on the other hand rather than sequence similarity methods you can think about optician records so you can get things like nucleotide frequencies, carbon biases and so forth so here we might have a sequence and we chop it into in this case frequencies of frequencies so a very simple freeway you have some kind of distribution and then you might have some kind of approach like a nearest labor method and we're in the center of each of them this is called a urani part and it basically comes each of these voxels in the center of the individual genome and then everything within this voxel is closer to that particular genome than anything else is a network is kind of boundaries represented by so something falls within this particular region here then is closer to that particular genome than it is to another genome and so you can take these kinds of distributions that means this kind of porno space and you can identify what the nearest labor is what is most similar what genome has the most similar composition in terms of the distribution of all these different codons for these different genomes that you have in your two-dimensional plant okay so there's a number of different methods nearest labor is just one there's ferris-baton and mixture models and so forth that you can use to take these kinds of distributions and make a prediction as to which genome your particular read is closest to perhaps most successful is this one called the naïve space in pacifier does a very good job it relies on these distributions of 25 minutes now if you have a 25 minute that's a really good signature associated with genome the point we've seen though is that it really requires very good reference genomes or very very similar genomes so that you can pick up these signatures that are in the pipeline a new one that's around is medicine v and this uses ever frequency profiles again and then brought as an example called rita which combines these blast searches but also uses these codon bias and nucleotide distribution metrics as well we've been creating one called gist which is what i'm going to push them to you guys today so this is a again it's like mvc mvcv it's a pipeline to annotate your reads to a specific taxa and the idea behind gist is that it integrates several methods but the unique aspects of gist is that it assigns a unique unique aspect of gist is that it assigns different weights to methods for each genome so it's tailoring for each genome which method is best able to discriminate sequences associated with with that particular genome from all the other genomes that are out there so it allows you to discriminate if you like in the best way the sequence associated with each of the different genome I'm not going to go through this pipeline here now the other thing that we do is it can take an expected sequence distribution so if you've done a 16s survey then this can give you the relative distribution of each of your taxa within your sample you can use that as a prior if you'd like to help guide the assignment of your read so if you see an awful lot of 16s associated with paravactoidies then gist will take that into account and suggest that if you have a choice between paravactoidies and something else it will prefer to go to paravactoidies so just to demonstrate how we assign different weights to different genomes so along the bottom here we have five different genomes and then these bar charts represent the weights that are associated with each of these different methods so we have we have eight methods and then we have VWA as well just in case we are able to match to a known genome and then depending on each of these genomes so as a result we're treating the seal strength 630 we find that this NaE phase approach that relies on amino acid distributions is a good way of discriminating what's true is that the seal can pair with these other genomes on the other hand some of that streptocac streptocac lacti-strain A9 and Y some of the example refers to use this Gaussian mixture model so to acid distributions so what this approach is doing is really tailoring which of these methods is best able to discriminate to all combinations of methods with these different weights so best able to discriminate that genome and separate it out and discriminate it from all the other genomes that are out there so this is graph of performance when we're comparing the gist with venus cv and nvc so venus cv has this unfortunate capacity to give up and just relate things that's unclassifiable so if it's not able to identify something that I give up and doesn't even come out with that so yes nvc this is a community from this non-OVS diabetic model this was grown under germ-free conditions it was inoculated with this altered scheduler flora so this is supposed to be in a community of eight different bacterial species nvc seems to classify a lot more than eight species and then this is our program here and I think we're doing a reasonable job in terms of identifying what are in these three different so we've got three different signs or three different marks here and I think we're doing a reasonable job in being able to assign each of our reeds to a specific taxonomic classes but again to mention that nvc works incredibly well if you have a reference genome but because of genetic diversity and its reliance on large canons then it can be problematic this is really striking so it's the same sample across just three different program yes they're like it's extremely different from just and then a CV can you explain like why the assignment would be so different between the two potatoes it does isn't it and it's a little bit worrying isn't it it is because if you're running these kinds of pipelines and you're running this tool and then you're expecting this to be true and you're publishing it exactly it's a little bit of an issue yeah right so a big issue for gist this is para-bactoides so the para-bactoides if you remember I mentioned it got selected out in our mRNAs and the ribosomal RNAs that we were able to capture from those experiments so as a result gist wasn't expecting to see any para-bactoides however when we then when we go back and spike in the para-bactoides then we're able to capture a bit better in terms of statuococcus for one so this has a large kind of statuococcus can't find it in better CV but gist has it as well so you can identify regions where there are some similarities and there are some differences between them yeah here we have a large kind of para-bactoides maybe this is doing a pretty useful job of para-bactoiding compared to here or maybe these are just grouping them into para-bactoides maybe in fact they should be in Australia I think an issue is that metacidia is a very good way of identifying Australia so that's so yeah so I think that's that's one of the areas that metacidia has done so well and it's identified from Australia NBC seems to have split a lot of these into more groups as well so there should only be eight different taxa in here so it's kind of splitting users users through and only one that we should expect to see and we know that there there's bacteriolides in there and para-bacteriolides I also know that there's some claustridia as well so okay it's not really been I think it might be inferior so it's not able to pick up I mean that's good so it's not able to pick up the same number of claustridia that just is so we're looking at a ground truth of this because they've recently just published the genomes of the altered scheduler flaws so it'll actually be able to see what has actually happened and how consistent are each of these taxa in here or here or here and how to lay back to the actual known genomes so are we finding that for example entropoccus stapropoccus they all match to just one genome for example where is duria would be duria sequences are actually also back into that same genome yes yeah sorry correct me if I'm wrong but one of the main issue here we have is that we have short reads so it's very hard to know so what about generating a long between the transcriptomic data and the genomic which is easier to assemble like you get longer reads and then and then take the genomic data and then like the reads to it so you get the same sort of answers but with the right person you need to know so yeah so as I mentioned earlier you might want to consider running a metagenomics whole shotgun DNA run in addition to your RNA sequencing run as well for exactly that reason so one of the take which is said is it doesn't have the reverse genome have you tried to you know do the pack bio sequencing which is give you you know longer read use that as a reference you know on the same thing the math with others probably you know it probably captures you know the problem we see here like you know smaller thing right I'm not sure that necessarily need to go to pack bio you can probably just do a next seek whole genome DNA shotgun analysis I think that would probably be sufficient so that would again potentially get over some of these problems as well so in terms of what we actually get out of the relative expression that of the ZmRNAs that are associated with these different taxa here we're comparing between the 16 and S diffusions that we see with the taxa distributions of reads that we've calculated and this is just using the last and we do know there's some not quite so subtle differences in the abundance so one issue is that there might be biases in the ribosomal RNA sequencing like biases in the mRNA sequencing or it might just be that we have a large number of reads here that are associated with the clostridia for example there's of the lines we have are here in the 16S distribution the expectation is that maybe the 16S is picking up a lot of taxa that aren't necessarily very active so you're actually seeing these things that are very abundant within your body that are actually contributed to the function of that line so again this might encourage you to think about performing metatranscriptomics okay we're getting close to 10 o'clock I would suggest that I would wrap up the remainder of this lecture and then we would take a coffee break before during the tutorial is this acceptable to everyone okay good all right so we have this fantastic pipeline it's processed hundreds of millions of reads you know what genes all these reads come from you know what taxa they're coming from how do we visualize this information how do we actually get something so once we've assigned them to all the categories we can think about exploring their function and you've seen quite a few of these nature and science papers which produce these kinds of bar charts so in this case these gene ontology categories does anyone have any opinions on these kinds of displays where that you keep seeing from these whether it's metagenomics or metatranscriptomics do you find them informative so these tend to be incredibly broad kind of categories you know so you've got something once I might have more genes so so you have energy production metabolism or you have genes so so you get the transport and so forth what does that actually mean you're not really getting to what's happening at the molecular level and so I think what we need to do is really go beyond thinking about these cog categories and go categories and keg pathways for classifying each by reading to and saying that there's a functional difference here because this particular large process is up or down regulated maybe going to something which is more of a systems level more of a kind of a molecular level kind of detail to try and get some more meaningful information out of these datasets so beyond focusing on the broad functional categories we might start thinking about taking systems based analysis and this is coming from the perspective that we know that genes don't operate in isolation they really form part of these very complex intricate kind of functional modules so they can form parts of coding complexes they can form parts of their networks or even signaling networks and so the idea is if we can place our bacterial transcripts within the context of these different networks can we understand something more about the microphone at a functional level and so there's a number of different systems that we could think about mapping out information onto so one thing that we've been exploring a lot are E.coli holomots so we have a protein-protein interaction map for E.coli so this is our model bacteria and we have this protein interaction map which acts if you like as a global overview of how each of the proteins within E.coli are organized into complex organized into functional modules organized into complexes and so what we can do is we can take our many transcriptome data we can take those transcripts that we identify and we can turn those transcripts into E.coli homologs and identify their E.coli homologs and we can map the information from those transcripts and the expression into these networks so that's what we've kind of done on the right-hand side here is we've taken these kinds of networks and this shows the rate of conservation so it's used on the after the rate of conservation are different proteins within these different modules so this is a module genes involved in genesis these are modules of genes involved in select transporters and these are in the module of genes involved to define the transport and you can look at how well conserved some of these proteins are that might give you an indication of what is the expectation that we would identify these things in some of the first place which maybe gets at one of your earlier questions but we can map on the abundance of mRNA transcripts and it's not a obvious mapping between the relative abundance of the conservation because if everything's just random you wouldn't expect when you map our expression data onto these networks that these patterns should be the same but they're not because we do find for example the interval is rich are well conserved aren't necessarily well represented within that data set so this is giving us an idea that these kinds of approaches are actually informative so if we can take our transcripts we map them onto or we identify the E. coli homologs we use the RPK and values associated with those transcripts and we can get the relative abundance of the genes within this particular pathway so we're using the E. coli protein protein interaction network as a proxy for understanding some of the pathways some of the complexes some of the systems that are present and abundant within that data set so does that make sense so by using this kind of network approach we can start thinking about identifying genes or systems that might be upregulated in for example a disease individual versus a healthy individual this requires some kind of development of statistical level platform genes and enrichment analyses for example might offer one possibility to do this these are things that are currently being developed so we're going to go ahead and think about this network approach to identify groups of genes that share common public or linkages so for example these five genes here are part of the tryptophan operon but you can see that these two genes are particularly high expressed these three are okay so you might suggest that for some are not expressed some are very well expressed might not be statistically significant for true examples however if we look at these other genes here which are connected to these genes but not necessarily to these two genes over here we might infer that this represents a functional module that the group of these genes that aren't necessarily part of what was defined by Peg or Go or Genontology by using this kind of network approach you can look at the connection to the two genes and infer your own kind of relationships and your own kind of groups that may be differentially regulated from a more volatile perspective yes so so this was a network of protein protein interactions that was generated 2009 so this was a combination of a tap-tagging proteomics approach where you take a E. coli protein you have a tap-tag you put it through a column and so everything that is interacting with that protein is also stuck to this column you elute it and that elution goes into a mass spec and so you identify that protein plus all the other proteins that are associated with it so that's one data set that feeds into this network then there's operon analysis so this is where genes that are involved in the same operon tend to be functionally associated we have phylogenetic profiles where if a gene is found in the same sets of taxa and they tend to be interacting there's gene fusion analyses this is where if you have two proteins in E. coli which are separate proteins but then in another organism they're actually fused into one individual polypeptide unit that's again indicative that they have that they have some kind of functional association so we combine a whole group of these different methods to build this kind of network so there's also the string database which is presented by I think it's embal pair groups pair box group again functional associations between bacteria here this is just focused on E. coli and this is one of the limitations of this approach is that we are going to non-identify proteins and systems that are associated with the taxa that we might find in our particular microbiome and so we don't have a protein interaction network for bacillus we don't have a protein interaction network for streptococcus we don't have a protein interaction network for anything really apart from E. coli and so this is really acting as a proxy so if there are other types of networks that you can think about using then it's a viable approach and one that we've also used are these metabolic pathways so you can take metabolic pathways converted metabolic pathway into a network representation and apply similar approaches so again just we're just using the coli protein interaction map because it's the only protein interaction map that we have from model organism perspective yes so for those network diagrams I'm assuming that clothes and open sizes are reflecting that total count so sorry yes absolutely so this is the total RPCM all the transcripts and you might find another map is too but encompassing different taxonomic encompassing all the different taxonomic groups absolutely so in addition to the physical protein interaction or functional protein interaction networks we can also use these metabolic pathways and we've had some discussion over MGRAST and MEGAM and so you can map your expression and to EC and to ECs which they map into these networks and you can follow those by having them to that expression with a tile line or a particular pathway but one issue with these kinds of approaches with MGRAST with MEGAM is that they rely on these keg definitions of what a pathway should look like and keg is a very well curated metabolic pathway resource but its main focus has been on again E. coli largely yeast and human and so a lot of these reactions and a lot of connection between these reactions are inferred from focusing on those three model systems but we know that there's a whole bunch of other reactions out there which are made for you to shunt between different subtleties in the regions of the network in addition you might find that part of this network or actually part of this part are actually another pathway somewhere else and so by having these relatively simple pathway representations you're actually missing those connections that span across different pathways and so what what we attempt to do is these kinds of more of a network based approach again where rather than assuming that and it's not so easy to see but basically each of these nodes in this particular network here represents a different enzyme classification number and the links between these different nodes represent concentrates and so this enables you to take different groups through this network that aren't necessarily captured through the keg-defined pathways and so we can start thinking about less of these kind of model organism kind of networks that are focused on model organism metabolism and try and identify new routes that different species within your microbiome might actually be exploiting in order to perform a new similar biochemical reactions and so what we can start doing and then this gets into the idea of how we can identify which taxa are associated with these functions we can use a tax-domic data that we've generated from before over a screen of glass or it's room basically or so forth we can start mapping these this kind of pie charts if you like within these kinds of networks so here we have this enzyme EC-4.2.7.1 it's involved in intercommercially aerobatic to acetylchlorate and within this particular microbiome it could be largely mediated by proteobacteria so you can start using these kind of visualization tools if you like to interactively explore your data sets to see what are the keystone taxa which are contributing perhaps important limit to within this network so if you took this one away I don't know in terms of roots that this microbiome would be able to fulfill in terms of generating different types of metabolites or if you take this away into the hole the metabolites would collapse in the form of such a vital function that you no longer have a functioning microbiome and so does this represent to the proteobacteria here represent a keystone taxa which should provide a unique functional capacity within this network so again where you that was a bit of a mess of you we've kind of tied that up cytoscape is a really nice plugin now which enables you through this relatively simply so here we don't care about which taxa are responsible for different functions the size of those are related to the genetic expression of each of these different genes or processes within this particular cell or biogenesis module but then we can use pie charts about using the sub-state plugin to actually represent the relative contributions associated with each of these different functions so in my time that these four cell or biogenesis genes here a lot of these are contributed by this think taxa which I think are often valid so maybe they're performing these kinds of functions and regulating them as part of their way in which they're generating their cell wall so the cytoscape is just a really lovely tool in order to take the kinds of data sets that we're generating after processing all these sequences and to represent these nodes these proteins if you like within the systems is pie charts or donuts even which allows you to look at the relative contributions of different taxa all right only three more slides before coffee yes Rob so we do have we do have a couple of these and I I don't have a slide showing this on foot do I nope so we do have a slide which shows four of these from four different mice and you can see subtle variations between the relative contributions of different taxa but they all look generally pretty similar okay so we could suppose that those are relatively you can see see those as not even different time points but different different biological replicates even and they look pretty similar in a way that they look completely different to kimchi to the deep sea and to the other taxa and so that's all we can say at the moment because we don't have any other data sets that we're able to compare against so at this kind of level where we're comparing between these mouse replicas and we compare them to kimchi and so forth you really are able to see taxonomic contributions that make sense from a biologist's perspective that are contributing to these different functions so the kimchi is dominated relact over sillae and we have leuconostox and these are the two main species that are using kimchi fermentation whereas here we see a lot more claustridiales and bacterioles which again we expect to see within a mouse gut and then the deep sea we're seeing a lot more proteobacteria and delta proteobacteria which we don't see in any of the other data sets so it does seem that you do get these kind of unique signatures associated with these systems that is representative of the microbiome that you're sampling yeah so again just to emphasize that we've only done this comparison for four or five different data sets where we process everything exactly the same way but it looks very promising this kind of approach okay finally last two slides about statistical considerations so metatranscriptomics I'm trying to keep emphasizing it's exciting it's new but there's still a lot of progress to be made in developments of software and tools for metatranscriptomics analysis so there's no dedicated software or statistical tool for statistical comparisons of metatranscriptomic data sets we're kind of on our own we're not even sure of how many biological replicates which kind of feeds into Rob's question how many replicates do we really need to know to see how much variance we can identify not just in the samples and the generation of the mRNAs and the cells but also in the foundering processing which yes at least two practically at least three of us again these are expensive experiments power analyses so we come across reviewers and they ask for some kind of power analysis which seems a fairly valid point but how we actually go about doing these power analyses for these kinds of experiments if we can use very much back with the inflow of preparations at the moment differential expression of individual genes can we identify those so there are a couple of tools which we can start with that applying to it or differential expression of individual genes or if we're thinking about if we're thinking about genes and how we're mapping the reads though we have to also remember that these genes like the reads that into a group of polylogs from an identical gene I reckon starting about gene set enrichment analyses to identify which of the function of what you should give the differential expression so given these problems with the involved breadth of power analyses at the moment we're really doing their transcripts which is a hypothesis generating kind of procedure requiring subsequent targeted validation so you can run these experiments you can identify whether this component here or this complex here or this pathway here seems to be generating an awful lot of this particular enzyme which produces this metabolite maybe this is a metabolite that we want to do our targeted metabolites on and this is a similar kind of analysis to what was done with the metagenomics analysis on obesity that was published about a year and a half ago where they identified these different taxonomic groups and they identified pathways associated with these taxonomic groups that were upregulated in an obese twin versus a non-obese twin and then low and behind they're doing the targeted metabolomics and they were able to validate what their metagenomics experiments initially did okay so while there are no dedicated tools for the transcripts analyses there are these RNNC methods DEC, HR, LDECs and you want LDECs a tool that was developed by a colleague in London, Ontario Gregory Gloar and he does I'd encourage you to read that paper it's based on the LDECs too I think it just came out last year so there's a very nice explanation as to why some of these other methods such as the E-seq and HR which initially were applied to micro-experiments why some of the assumptions of micro-experiments don't hold true for RNNC and why LDECs is an improvement of the assumptions associated with RNNC versus micro-ray and our turn too is to simply rely on farm change rather than use those kinds of experiments so can you identify these modules again where you see huge differences in groups of related functions of genes that are related then you can preview these into genes and measurements and then there are challenges as to which genes you include should there be a minimum RPKM cutoff and say that anything below say an RPKM of a certain value very poor representation maybe this is noise maybe we don't consider this in these kinds of analysis so what is RPKM? RPKM so RPKM is the reads per kilobase of sequence matched per million base per million read sequenced so it's a way of normalizing for the length of the transcript DCC uses the cut right? RPKM is from cut the DCC which is an RF package which they have they use the read cut which is they also explain the RPKM is not reliable than the read cut so there's also an issue of DCC I don't know if DCC2 is able to overcome this but the way that you do an RNA experiment is that you're limited to X100 million reads sounds a lot but you still got only X100 million reads with a microwave experiment when you map on reads onto your microwave you are limited and so there's no compounding factors that the abundance of one read can affect the abundance of another read with RNAC the relative abundance of one read can affect the abundance of the other reads and so DCC and HR did not take that into consideration because they assume the same underlying model that the microwave in whereas because RNAC has this issue of the limited number of reads that you can generate and those interdependencies that's why ALDEX is reported to be better well it's interesting we ran these DCC HR-ALDEX against our mouse datasets because we did actually have three biological replicates within our mouse dataset and we found that DCC gave us maybe three or four differential expressions HR maybe two and ALDEX maybe three and they were all different so again it just highlights the need when you're applying any of these tools to your dataset that that tool that you're applying could really bias what you actually get out and so you need to be very critical about how you're looking at the results that are coming out of some of these approaches and then finally just mentioned gene-sett enrichment approaches so there's a number of different methods out there and psychometric distributions might be one that you can consider using and with that we will be on a coffee break