 So this is Metatranscriptomics. These are the learning objectives. You should have a sense after this lecture, perhaps, and it's an ongoing tutorial, potentially about the opportunities and some of the challenges associated with Metatranscriptomics analysis. Maybe understand something about the capabilities of what Metatranscriptomics can actually do. Again, an appreciation of sample collection, briefly discuss issues of experimental design. And also, perhaps importantly, from this bioinformatics workshop perspective, learn something about how do we process what are actually quite complex data sets. And so the tutorial will take you through a set. So it's a small data set extracted from a cow rumen that was generated, I think about 2012. So it's 100,000 reads to keep it relatively small, but you'll be processing those and just seeing what happens at each of the various steps of processing and where various kind of areas and what you actually find in these data sets. All right, so the overview of the lecture. I'll go over what is Metatranscriptomics and how does it relate to RNA-seq and how does it not relate? How is it different to our traditional RNA-seq, what we understand by RNA-seq? I'll go over issues of experimental design, sample collection preparation and so forth. And then take you through the various steps of processing these reads. So what is it that we need to take into account when we're actually taking our data set and we're processing it and we're trying to get some information out of it. I briefly mentioned something about statistical analysis. For Metatranscriptomics, it's a relatively new field. The bar for publishing is really still quite low. So in terms of the statistics, they're just aren't the tools and methods that have been developed out there. So it's a wide open field for developing new kinds of analysis. And then visualization. So how do we visualize the output and I'll be pushing this concept of using these kind of systems-based data sets to layer on if you like, to have these systems-based networks as a scaffold in which you can layer on your Metatranscriptomic data to help interpret something about what's actually happening in your sample. All right, so I think Morgan went over kind of these differences yesterday. So 16S RNA surveys, these are relatively simplistic in terms of you just want to get an idea of who is there. So you're getting using 16S as a marketing just to identify what who is in your actual sample. It's been really widely applied. The technique that's received the most attention over the last eight years or so. But the problem with these 16S surveys is that they really yield only limited kind of functional mechanistic insights, if you like. So, okay, this is a study of GI tracts or how can individual, and as you go down the GI tracts to find that the different taxicomic groups, all to proportion, but you see that the IBD gut is very, very different to the healthy gut. But the problem is that you don't know if this is a cause or a consequence of IBD. So is it just because the IBD gut is so messed up that it's now home to all these other bacteria? So the 16S RNA surveys are really just giving you relatively limited insights as to what's actually happening. Vaginomics, on the other hand, here is the picture to look at what your bugs can actually do, okay? So this was a study from the InviteBio consortium 2012. And in the top panel here, you've got all the different taxa across 120 individuals, across these different body sizes, Storm, Tom, and so forth. You can see in the top panel that each of these individuals really has quite a different kind of composition in terms of what bugs are actually there. But in this bottom panel, this is a metatronomics analysis where they interpret it with different functional categories. In this case, this was a metabolic pathway as I've expressed, but then across these different 20 individuals in these eight or so different body sites. And you can see that the functions appear to be relatively stable. So the idea is that maybe it doesn't matter what bugs you have because they're still capable of giving you the same functionality. And then the next step in this kind of process is metatranscriptomics. And this goes a little bit beyond metatronomics, so metatronomics is really giving you an idea of the functional potential, but it's not saying what's actually expressed. But that's what metatranscriptomics, that's where metatranscriptomics comes in, here we can see who is doing what in the sample. So it's looking at the microbiome activity, what is actually active within your sample. And the way it does this is by using RNA-seq. And RNA-seq we've probably heard a lot about, but we're using RNA-seq in this context to really determine what genes, what pathways are actually being actively expressed within this community. And so, as I mentioned, we can have these kinds of systems, like data sets, this is a map of sheet, and these are the kind of protein interaction networks that might represent things like protein complexes or chemical pathways. And you can layer on metatranscriptomic data sets to see which of these nodes, and these are the kind of sets that the proteins are involved in these pathways. They're relative to expressions and size of the nodes, largely, so then we need some more express, that particular protein is. And then you can use these kind of pie charts to use these different proteins to indicate which taxonomic group is responsible for each of those different functions. So again, the idea is that metatranscriptomic would be of active functions of the size of the nodes, and then by using this kind of pie charts picture, you can reveal which taxa are really responsible for these active functions within these networks. And the line is which one? The line is which one? In this case of functional interactions, so these genes are known to functionally interact with each other, so they're involved in a similar pathway. So you're looking for groups of genes which are highly interlinked, suggesting they might be involved in a biochemical process such as a TCA cycle, or they might be involved in a protein complex, involved in a transport activity, or so forth. And we discuss more of that towards the end. These are results from the study that we've been looking at this year, again, just to emphasise the importance of looking at the transcriptomic data and this activity data. So this is with a colleague in Colorado, and he took this mutant mouse line, so it's peri-lifing, too. This is a gene that's involved, I believe, in fat transport in mice, and he fed these mice high-fat diets along with the wild-type fats, and this is the file which gets a distribution to which fats are actually present in the sequence of these two-type mice. On the high-fat diet, in two-type, wild-type mice, you can see that the types of bacteria that you find, the filer, very similar. Okay, so you've got a similar distribution of bacteria within the guts of these mice. However, when we look at the actual, because we've done the transcriptomics, when we actually map the monster functions, we find that certain metabolic pathways encoded by the bacteria are actually differentially expressed, and I think there is about 14 different metabolic pathways which exhibit this differential metabolic expression. So to give you a couple of examples, here we have these two enzymes that are in the branch-chain amino acid pathway. Again, these nodes represent the relative contribution of the taxonomic groups that are expressing these different enzyme activities. You can see that this different type activity has increased in the pin-2 mutant. This enzyme has gone down in the pin-2 muscle. So the idea is that the genotype of the mouse is actually affecting which genes have been expressed by the microbiome. So again, this isn't something that we will necessarily pick up through 16S or through metagenomics. So I think the story that we're kind of settling on here is that because this pin-2 mouse is deficient in doing fat transport under a high-fat diet, you end up with the accumulation of these fats in the lumen of the intestine. As a consequence, the energy pathways are getting undergoing altered regulation in order to account for this accumulation of these fatty acid substrates. So again, this is really just to emphasize why we believe that metatranscriptomics is really an important kind of aspect to us studying the microbiome. So any questions on the why's and where falls? So far. Oh, good. All right, so how do we go about doing metatranscriptomics? So it's based on RNA-seq. Who hasn't heard of RNA-seq? Okay, great. So this is really a diagram of the pipeline as to how we do an RNA-seq analysis. Try to imagine the transcriptome readouts in a different topic. We've got air, the top of it might be the context of the sequence of the males to extract the RNA, but the RNA is really all the transcripts that might be associated with that, obviously showing differential functions at that point. You fragment each of these different transcripts and the sequence, and then once you've got the sequences that align all of these reads to know the transcripts to have two methods of expression, you end up with a kind of a digital readout or gene expression, if you like, just by the number of reads that are mapping to your known transcripts. So typically, RNA-seq has been applied to organisms with a reference genome, so you're getting these little fragments of these reads and then mapping to a reference genome that you already have. So you might do RNA-seq on a mouse and then you're generating all these reads and then you get a readout of all of the relative expression of the genes that the mouse is expressing at that point in time. However, applied to microbiome, we have a number of additional challenges associated with this whole process. Okay, so if we think about a typical RNA-seq experiment, it's typically applied largely to eukaryotic organisms, mRNA is isolated, our fragmentation sequencing reads a map to a reference genome, and there's a number of standard software, that's available, MacBWA. And when you do the smacking, this provides support. First of all, the transcripts is actually expressed, so that's why a lot of people are doing these kind of RNA-seq experiments, they just want to confirm the expression of their transcripts. They also look at the relative abundance of the transcripts, so which transcripts tend to be up or down regulated between different samples. And you can also detect different isoforms, so particularly for eukaryotes, so you can have alternative splicing and RNA-seq is a way of seeing what the differential splicing patterns are, again, between samples. However, for microbiome samples, we have some considerable problems. First of all, we have a lack of a poly-A tail. So for RNA-seq extracting the RNA, it's pretty easy because you just do this poly-A enrichment. In microbiome samples, we don't have this poly-A tail. So it can be difficult to isolate the bacterial messenger RNA. And so this results in really massive ribosomal RNA accumulation. So most of the RNA that you get within your samples is ribosomal RNA. So that's a huge problem. How do we separate the mRNA from the RNA? The other problem we face is that these environment-isomal microbiome samples, depending on what we are trying to analyze, we just don't have these reference genome data sets. So the actual mapping and trying to identify what the origin of each of our reads in our sequence data is and what Genie actually belongs to, again, is a very challenging proposition. Sorry, why are there so many messages? So our RNA is ribosomal RNA. So these are the RNAs that are responsible for forming the ribosome and helping protein production. So they're very abundant because you need to have a lot of these in order to generate a lot of proteins. Whereas the mRNA is actually the message that is translated by the ribosomes to form the actual protein. So the RNA is really just a machine which is generating all of the proteins. So if we want to understand what proteins are actually being expressed, then we need to target the messenger RNAs. So they're the actual transcripts which are then feeding into the production of proteins. You just wrote detector. No, so again, it's the same issue with the metagenomics that the really low abundance ones, if you've only got one or two, say one or two of these reads in your dataset, they tend to just get thrown out. Like to decide what actually that being. Right, so that's a very good point. So the correlation between mRNA expression and protein expression isn't necessarily linear. So people are now proposing doing metaproteomics which we're not going to discuss today, but this is where you take your sample and you're basically doing a shotgun proteomics on that sample, fitting it into a mass spec machine and then determining what proteins are actually within that particular sample. But that has its own sets of issues as well because you really get swamped with high abundant proteins and so you don't get very much signal back. A study showed that the RNA being translated in protein in the brain is in human, is that correct? Do you agree with that? No, but I remember it. I suspect it might also vary from organism to organism and also upon the conditions that the organism is actually being exposed to as well. So there's probably a lot of factors that beat into it. And there's very little data concerning bacteria and the relationship between mRNA expression in bacteria and protein abundance as well. Like ribosome, where do you release yourselves, extract their ribosome and you basically capture all the RNAs that are actually inside your ribosome and you can do like a trial of this standard RDC procedure and then this capturing thing and then you can compare which of the RNAs are actually present and are being processed in the ribosome and stuff like that. Right. But it's not applicable for that. There's also huge issues in terms of post translation modifications, making the proteins active and then also the amount of protein turnover. So some proteins tend to be more stable and last longer in the cell than other proteins. So all of these things are an additional level of complexity that we're not going to really tackle today. We've made a transcriptomics. Awesome. Sorry, Morgan. All right. More than likely, I don't think again we've done the studies to actually look at that in too much depth, but certainly having reference genomes for a sample that you're doing metatranscriptomics on and having a more simple, if you like, microbiome makes it a lot easier to annotate and I've got a slide which kind of shows that about ribo depletion kits. Yes. Yes. All right. So, what does our metatranscriptomic analysis pipeline look like? We've got our mouse, character sequencing, we generate the reads. Once we generate the reads, what do we need to do with these reads to actually extract the information that is of interest to us? First of all, we need to remove quality sequences. Okay, this is kind of an issue from sample to sample. So, we looked up one sample that somebody had deposited at the NCBI from the permafrost and around about 98% of that sample was low quality sequence. How it appeared in the NCBI, I don't know, but there was, again, a data set at incredibly low quality. After we screened low quality, we now need to identify the rate of time that's made just there. And there are different tools that are more sensitive for detecting ribosomal RNA reads. And that can be quite a slow step, unfortunately. If we're dealing with a sample that is associated with host, we might want to remove those host reads before we do any further analysis. And so we need to identify which reads are associated with the host. We like doing an assembly step. This is important because the longer fragments of RNA we can get, the better probability we have in terms of being able to annotate it. Again, I'll show a graph as to the relationship between length, sequence length and our ability to annotate. Once we've done the assembly, we then track it to some known genes to identify the fact that the thresholds associated with each piece, and then we might think about mapping to some kind of artworks to help interpret our results. Maybe some kind of sample comparisons. So that's the, I guess, the key steps in any kind of difficult pipeline for metatranscriptomics. Okay, so the first steps, sample collection, RNA extraction. So we had a brief discussion on this yesterday about what are the best practices in terms of collecting and storing our samples, and there really aren't any SOPs out there. People are still trying their own ad hoc methods, and it's not clear what the best methods are for. Collecting RNA, storing it before we actually start sequencing it. And we know that depending on the method of storage, the method of preparation, this can really dramatically impact things like which taxa can be recovered. So our current best kind of suggestion is that you process your sample immediately to extract the RNA, and then you can store the RNA at minus 18. It tends to be relatively stable. The next best is to snap freeze liquid nitrogen, for example, again in store at minus 18. However, as we know, when we're dealing with clinical samples, and you've got your MD or who's in the operating theater taking a biopsy, for example, having liquid nitrogen around might not be the best, and so there are certain compromises that you have to come up with in order to be able to get these samples in the first place. So just be aware that depending on the quality of the sample that you are able to get hold of, that might actually impact the results that you get. We would actually suggest avoiding the use of RNA later. So this is a chemical that you can add to your RNA samples to make them more stable. However, there's some results which are suggesting that it can interfere with RNA extraction kits. And so again, you might get biases associated with the use of something like RNA later. Okay, so in terms of collecting these samples, how many samples do we need to collect? How many biological replicates? The issue in metatranscriptomics is that if it's relatively expensive, so it's generally approximately needs to sell for, I'll mention why we, oh, that's a little bit. Around about $300 to $400. I don't know how much Morgan's charging these days. Probably about that. Probably about that. So it's relatively expensive, so experimentally send these a little shy in terms of wanting to generate biological replicates. So we would suggest at least two, but when people have been doing RNA-seq experiments, really the minimum that you should be doing is four and you can just about get away with three. But again, I'll mention that for metatranscriptomics, the bar for publishing, again, is relatively low. So you can probably get away with probably two replicates. So it's only five to two? Two samples from the same individual, yeah. Yeah. Well, this is where the power analysis come in and there really aren't any good frameworks that have been developed so far for doing power analysis of metatranscriptomics. So again, until these kinds of issues get resolved and then the bar starts getting raised, you can see these kind of experiments more as hypothesis generation. Okay, so to get an idea of potentially pathways or genes that might be upregulated in your sample of interest, which you can then go in and design more appropriate experiments with the power that you need to actually give you the statistics. Do you comment about it? I guess the way I'm thinking about it is if we're taking biopsies from kids in testings, you're taking two different sites. We'd want to do that. Are there any extraction to get the amount that you try to avoid after just using one extraction? We try to avoid that, yeah. I really wouldn't recommend doing pooling of samples if you can avoid it. But again, you can be constrained by what samples you can actually get hold of. So it's all about these kinds of compromises but be aware of how these different compromises may actually impact the interpretation. Well, I would suggest that you do it for all. Of these kinds of analysis. Yeah. Yeah. Well, Morgan, what would you suggest for biological replicates? I think it's like really economically not very easy to look for. Is there like another method you could use to complement your results? Sorry, I'm not following. So in terms of, so I would suggest that you see these as a kind of discovery phase and then you do more cost-effective experiments that are subsequent to that, using, for example, targeted PCR as well. Down to sort of RNA-seq and those duosicates and then stabilize her to the average or use statistics to figure out what the actual... If you have a number of papers just for a few seconds, you can say RNA-seq. I'd say that if you can afford like six million for each, you're better off splitting six million into 20 million for replicates. To get a better physical power than you would by doing something else, right? So if that's the case, right, splitting that, splitting the sequencing of eight to three different sets, then they can keep getting the same amount of data. Although, again, it depends for the RNA-seq, you can probably get away with more limited number of reads because you've got a more limited set of transcripts which are actually present in that. Whereas here, if you want to capture the depth of the transcripts that have been expressed by a lot of different organisms and a lot of different pathways, that's adding an additional level of complexity. So for example, in the... Or is it this plin two study? This is where we use four replicates. And you could see that actually four, they, there was quite a bit of difference within a sample across those four replicates. So it is an issue during these biological replicates for these experiments is an issue. And you're right, they are expensive and these are expensive experiments. So obviously for... I guess you're talking about stool samples. Yeah, sorry. Sorry, I think someone told you... Yeah, so the stool sample just needs to be dumped in the fridge, by choice of words, needs to be dumped in the fridge as quickly as possible. And then you're hoping that they're gonna bring it in as soon as they can. But for a lot of those experiments as well, I mean, you're wanting to take stool samples, probably sampling from the same individual three times over a week to account for the impact of diet as well over that week. It's all you can do in practice. Again, this comes back to that discussion we had yesterday that it's staggering that there are so few studies that have actually tried to look at these kinds of storage conditions. So I do have a couple of slides which kind of addressed that a bit, but I've heard again that they're a little bit unsatisfactory, but we'll get a little bit more into that in a couple more slides. Thank you, too. Every 15 minutes I'll just go to the floor. The other thing it might be worth mentioning as well is we've actually been a little reluctant to ourselves in wanting to do stool samples for metatranscriptomics, because we feel it's not really reflective of what's happening higher up in the intestine. So we're really focusing these kind of metatranscriptomic studies on the more informative sites, so biopsies. All right, so we've got our sample, which is super awesome because we've got this person to come in to lab and donate straight into liquid nitrogen. But we now have this problem that these bacterial mRNAs lack this poly-A tail that you carry the RNA to. So how do we remove these abundant RNA species which can account for about 99% of all the RNAs within the sample? So there's a number of kits that are available. The one that we've been using is this ribo-0 and we get an aluminum. This is their result, so here if you don't defeat their kit, this is what they're showing, the amount of RNA RNA is 72% that they recover, and the amount that's back to the genome is 0.5%. However, with their kit, it's fantastic. I mean, 6% is RNA, and then the rest is bored. Or 70% of it is a messenger RNA. We haven't quite had the same success as these guys. So we're, I think we're able to fit about 30%. They do a reasonable job, and I think the ribo-0 kit is probably the best one that's currently available on the market, but presumably over the next few years, we would expect to see more of these depletion kits appear and get more and more successful. So this is from a study that we've recently, so these are about six different samples, going to try to return the samples, and just processing them and determining what is actually in these samples. You can see that very well. So permafrost, this is the one that I was mentioning, you can see that this gray bar adapts very quickly, so that was a really horrible sample. Here we have the Calrumin, we have the kimchi data set, and this red is the private terminal that inserts a lot. What you're actually after is light blue. Yeah, so typically a lot of these studies are, you're really looking at about 10 to 20% messenger RNA recovering. Yeah. So I've been asking this question with my own data, but where's the other 20, you're assuming you're already filtering it? So I didn't generate that, so I don't know, I don't know if that includes the low quality, it might include the low quality, it might include adapters, it might include host, so. Sequence, the feature distinguished in small pieces. So in terms of how are you able to identify, separate, filter out the ribosomal RNA from the messenger RNA? Yeah. So again, I'll get to that, we use a tool called Infernal, which is hidden Markov model tool, which is pretty sensitive. You can use Blast, and Blast will capture, I think about 70 or 80% of the ribosomal RNAs, but if you're still missing about 20%, that's why we switched over to the Infernal. And just to mention that host messenger RNAs can prove challenging, so depending on the sample, so we did a stool sample recently to see out of 16 different preparation methods, which is likely to recover the most bacterial RNA, but we ended up, out of all the messenger RNA we recovered, about 80 to 90% of it was human, which was quite unusual for the stool sample, but it might have been the patient that it came from. But nonetheless, the host messenger RNAs can be quite informative, because it gives you an idea of the host cells that are in, or neighboring the microbiome, how the microbiome is actually changing the host gene expression. So you could think about using these host reads to get a readout on host gene expression. All right, so this was a study that we did last year, which was comparing different types of storage and RNA treatments, and what these plots show here are kind of post-reads, which is a dark block, and then there's no one into the actual Qtative messenger RNA from bacteria. So it really had to be dominated by host. But in terms of comparisons to storage and RNA treatments have been able to recover messenger RNAs, we find that freezing seems to have a better impact on the samples, seeing across each of these different system methods, anytime that freezing sample continues to cover less mRNAs. Fresh is obviously better than storing it because it's a bit longer term, it won't leak even if you store it at 530 or store it in freezing forever. So preparing your sample spreadsheet like that, we found that right with zero kits better than the express kits in terms of depleting wide sample RNAs. We're gonna try RNA later, and if some of the XO RNA later is increasing your ability to recover RNAs, but we still have concerns about biases that the RNA later might be putting in in terms of the type of tax that you are able to recover. But I'm having philosophical issues about writing this up at the moment. So mainly because most of the reads are from the host and in order to get this stool sample, I think we only got verbal ethic approval to study it. So we're really interested in analyzing the bacteria. So it's not clear to me that we can actually deposit the host RNA material because of these kinds of issues. So I'm trying to wrap my head around that. All right, so generating the reads, we've got our samples. We've extracted the RNA, looking super great. How many reads do we need to generate? So again, in the study where we looked at these four different data transfer data sets, we did this rare infection analysis, you see how many reads do we need to sample to get so many, in this case, enzymes. Okay, so we think that within our microbiome we have a limited number of enzymes that are actually being expressed. So as we sample more, we increase the number of reads, then we can see that we've kind of leveling off around about the applied needs. And so if we think that within our sample and with our sequencing run, around about 25% of our reads are gonna be from mRNA, then we need four times five million, so we need about 20 million reads. Okay, are we following the maths there? Okay, good. So we would suggest generating around about 20 million reads a sample, and that would give you anything from about 90 to 95% of the enzymes that we would expect to find expressed within that sample. Okay, any questions? All right, so now the fun part and why we're here analyzing the data. So again, it's relatively new field. The software standards SOPs really haven't been put in place yet. And also, it's a really rapidly developing field, so there's always new tools that have been developed all the time. And so the way that we would view the development and anyone who's developing a metatranscriptomic analysis pipeline is that you have a kind of a framework and then you can swap in the different tools as they improve your ability, your accuracy to actually recover information and filter out unwanted reads, okay? So when you're thinking about developing this pipeline, it's really a set of wrapper scripts, Per scripts, Python scripts or whatever, real old schools who reuse Perl. But most of the bioinformatics is really not about the tools themselves, but it's reformatting the data to feed in from one tool to feed into the other tool. So a lot of the scripts that you'll be running during the tutorial are really focused really just on putting things in the right format. So when we look at the actual tools, what we need to do, we have the most established methods and the processing methods where we're actually filtering out low quality, for example, or adapters. These are fairly well established now because they've been applied to metagenomics to 16S surveys to even genomic software. We have assembly methods, which are pretty well established now. So we use one called Trinity and it does a pretty good job in terms of assembly. Rarely starting to develop annotation methods. So if we're generating these metatranscriptomic data sets, generally in one sequence one, we get about 400, 500 million reads. You've got to somehow, once you've filtered out all the chaff, you've got to somehow annotate those reads with some kind of function. And this can be really, really time consuming. So this is where you need cluster computing in order to give you the computational capacity to be able to annotate these reads and actually assign some kind of function, some kind of taxonomic information associated with each of these reads. And then we get to analysis methods. And again, there's been so few studies that have been published from metatranscriptomics that these really haven't been developed at all. So these established. So again, just to emphasize that when you're developing, if you're interested in doing metatranscriptomics, you're developing these pipelines. Again, it's really about getting these wrapper scripts and then feeding in the different tools that you need to do those steps of the filtering and identifying what you actually want to get out of these datasets. Okay, so in terms of reprocessing filtering, so we have a number of tools that are out there to actually do this filtering. We've been using one called Trimimatic. There are others out there. There will be new ones that appear all the time. But this one seems to be reasonably quick. Those are needs, that needs to really kind of be. How it works is that it has this kind of a sliding window that it moves across the sequence and it tries to look at regions that are of low quality and then it trims off, goes back into trims of something that seems as low quality. And then there's this other thing you can set that reads that are below a certain length that just drops off and discards them. But there are other tools out there that perform a similar job to this and we don't really see this as a major kind of issue at least for now within our minute transcriptome pipeline. So that's good. Assembly, why do we need to assemble our metatranscriptomic data? So the idea is that it really improves our adaptation accuracy. So as we go across here, I don't know if it's pre-developed by the machine. But round out here is, I think that's about a hundred bases there. How should we go up to about 200 bases? Your ability to annotate really increases up to around about 80 to 90 percent. Where if you blow around a hundred bases, your ability to annotate, and this is using tools such as Blast, really decreases. So if you can get these longer reads and you can do it by doing these assembly steps, then you dramatically increase your ability to annotate and assign some kind of function to these reads. This is just a simple graph that shows that trinities is bottom one and this is the portion of the read that we can actually annotate, and I think this works as a way to see if it's one and then metabel it was the other. I find trinity works pretty good. In fact, when we were looking at a deep-sea metatranscriptome, we applied trinity and I was very excited by the fact that despite all of the different reads and different texts that are in there, we were able to actually assemble and recover the entire Phi X genome. So it's a spiking control, but the fact that you've got all of these other sequences in there that could create problems with the assembly, trinity actually did a really good job in being able to recreate the entire Phi X genome. So I was pretty impressed that it has this ability to sort out the noise and not get swamped by all these other sequences. So there are problems with chimeras. So this is where you get assembled context where you get two sequences that may be from different organisms, but they share sequence similarity and so they get merged together and assembled together. This can be a problem. We found that the incidence of these chimeras is relatively low, so around about one to two percent. So again, we're not too concerned about the presence of chimeras within these data sets. Okay, so we've got rid of all the chaff, all of the low quality sequence. We have assembled our data. How do we now annotate our data set? So this is probably the most challenging part. So we typically rely on the sequence similarity search tools, BWA, BLAST. So BWA and BLAST, these are super fast. So you can screen hundreds of millions of reads with BWA relatively quickly. However, they really rely on these near perfect matches. And the issue is when we're dealing with environmental samples, is that sequence diversity is absolutely huge. So just to give you an example, this is a study from about 2008 or so where this group was sequencing a number of genomes from streptococcus apilacti. Every time they sequence a new strain of this bacteria, they get a whole new bunch of sequences. So even different species of E. coli can vary by as many as about 2,000 genes. So there's huge species diversity. If we think we've only got about 9,000 reference bacterial genomes out there, and then we're trying to compare millions and billions of different species that go out there, our ability to actually match using these really significantly compromised. And so we find that BWA, BLAST, fast, accurate, but they really don't help very much unless you have the reference genomes that are associated with your metatranscriptome. So this is suggesting that maybe if you're doing a metatranscriptomic analysis, you might want to complement that with a metatranomic analysis. So you then got a reference genome to then do this kind of mapping. Two, we have not. So Trinity is from the same school, I believe, as is it, BWA and bowtie. So the database that we currently use is a set of microbial genomes that we download from the NCBI, and then we use BWA to match across those data sets. But again, you can start expanding on these kind of databases. Okay, so in practice, again, we're going back to our five different metatranscriptomes here. So beyond BWIM and BLAST, we can think about using tools such as BLAST, which are a little bit imperfect. And we find that rather than relying on nucleotide searches, we do a much better job if we think about working in peptide space. So because of this third base pair wobble, you can get a lot of diversity at the nucleotide level, but that isn't then propagated at the protein level. And so if we think about doing these BLAST X searches where you're doing this six-frame translation of your nucleotide against a protein data set, then our ability to annotate them really increases. However, one problem with BLAST is that it's very time-consuming, and again, you really need cloud or cluster computing. This is kind of getting overcome with tools such as diamonds, and Morgan introduced me to one called Vsearch, which is a free version of Usearch that might offer speedups. Diamond in particular, we found quite useful, but there are issues with these tools in terms of quality. So the results that we get out from diamonds where they're somewhat comparable to BLAST, most of the time, about 90% of the time, they don't give exactly the same results as you might get from BLAST. So in our current pipeline, what we do is we use a VWA search, we use a VAT search, and we use a BLAST search, and then we see the performance of each of these different search strategies really varies on the microbiome that we're talking about. So Kimchi is doing really well. It's a Kimchi, its purple line here is BLAST, so this is where you're looking for the imperfect matches. And the reason why Kimchi is doing so well is that we actually had reference genomes for the Kimchi sample, and so we were able to annotate a lot of these reads from this Kimchi Metatranscript home really very well with these. However, for these other samples, our mouse beeps, you can see that VWA, BLAST, only gave you as much as about 30% of the genes, can you actually annotate? So this is a little bit disappointing. And then when we actually look at the quality of the matches, so these are relatively short reads. So this might be a typical match for a 71 base pair read. You have an E value of 39, okay? For those of you who run BLAST, you're normally expecting something like E to the minus 39, this was actually 39, okay? So statistically, it doesn't look like a great match. But if you look at the species that you actually recover, it looks about right. So rather than relying on the Z values, our idea is to look at these kind of colors. So these are, as you go across the read length, this is the proportion of batches where you get a certain degree of batch, so 80% of the 100% sequence match is across the read length, and so we have a cut-off that captures this region here, okay? So we have a cut-off of about, I think it's 85% sequence similarity across 65% of the read length is our cut-off for getting a decent match. Okay, so we've processed our sample, we've identified what's in there, we've annotated our reads. What we need to do now is we need to normalize the expression of these reads, okay, so this is a typical step that we do in RNA-seq, and we need to take into refat that different genes may have different lengths, and so if you have a really long gene, you're more likely to sample from a long gene because it's just a bigger gene, so we need to account for that. So there's this method which results in these reads clicking the values of transfer of math or other values. So this is where you're converting from a few more read numbers. So for example, these we're doing here in there with different lengths, converting the number of reads that are associated with each of these into an after-pay-and-pay between the pages of what it's going to expression is. So this one here has a second number of reads as this gene here, but if it's much shorter, the likelihood of getting reads coming from this is much lower, so you ain't got any much, okay? So it's a way of normalizing and accounting for the size of each of these transcripts. So there's several software tools that are available to do this, such as both hand cuplings. So we've done some kind of functional annotation about taxonomic annotation. We might actually want to know where these reads are coming from, what are the organisms that are responsible for expressing these particular genes. And again, this is just to emphasize again that the actual species that you have in your sample can need to be sent to your client. And then assigning these RNA reads to specific taxa might give you some clues as to what tax are responsible for providing critical functions within your microbiome. Another area that we've been thinking about is maybe we might want to do this annotation set for taxonomic annotation prior to assembly so that what we could do is we could separate all of our reads into these different bins from these different bacterial species and then assemble so that we don't have these problems with chimeras cropping in. So the issue is how do we go about assigning taxonomic information? So there's these alignment methods, such as BLAST and BWA, but again, these can fail where we're lacking suitable reference genomes. And this is a particular problem where we have short read data sets. So most of the focus has been on these compositional methods, nucleotide frequency code and biotids and so forth. And so the idea is that you can classify all of your reads into frequencies of genomes. So for each of these different genomes that are associated with it, and then you can use this profile like what is it the genomes using or something like this labor method to try and assign that sequence to the genome in this particular sequence space. Okay, so the idea is that you're coming out with some kind of KMA profile, some kind of signature of what that sequence actually looks like and then trying to find the closest genome that it would actually map to. So there's a number of tools that I'm going to do this. A big C, I mentioned yesterday. Kraken is a pretty relatively fast method. And one called Clark, which is based on MBC and gains is relatively fast, but has some problems as well. But the gold standard currently is MBC. It's just that it takes an awful long time to run. We've been developing, so this is my advert, I guess. So we've been developing a tool called GIST to try and take these compositional metrics into account. So this is a computational pipeline. The idea here is that we're combining, it's an ensemble method, so we're combining six different methods into one. And then we're seeing which of these different methods really performs best for each of our different genomes. Okay, so we have different methods such as Naive Bayes, we have Gautam Mixed Module, we PWA to see if we do actually get in line to get the solution. So there's different methods involved. And the idea is that for each of these different genomes to score how well each of these methods does in terms of discriminating that genome against all the other genomes. So we can say, for example, for one particular genome, you might perform better assigning a sequence to that particular genome using an alignment method over some kind of computational method. Okay, so you're really exploiting the ability of different methods to work better for assigning these kinds of matches across different genomes, if that makes sense. So in terms of performance, we've been applying it to, this was a germ-free mouse where it was inoculated, river-defined, microbiome called this altered scheduler flora, which is thought to comprise about eight different taxa. We compared it against gist, there's medicine you need between an NDC, and then when we look at the different taxomic billions that we get from these different methods, by that gist identifies, I think, about nine different taxa, which is practically altered scheduler flora, which is thought to be eight, whereas NDC, you're gonna have 15 or so. So we think that gist is doing a relatively good performance where you don't have reference genomes in terms of being able to assign your reads to specific taxomic groups. Sorry, I'm sorry, I wouldn't... Shedler flora is... So this is a microbiome that has been developed in the 90s to inoculate mice on the specific pathogen-free conditions in order to have a fairly standard microbiome that can be transferred from lab to lab to minimize the impact of a altered microbiome on the results of what experiments they might be doing. ASF is altered scheduler flora. Okay, it's worth bearing in mind that the mRNA expression is not equivalent to the ribosomal RNA expression that you might get from the 16S sequencing. You do actually get quite different distributions. And there's a number of reasons that might come about. One is that there might be biases in the ribosomal RNA sequencing. There might be biases in the mRNA sequencing. But also there's this idea that just because an organism is very abundant within your sample, it isn't necessarily saying that it's particularly active within your sample. So you shouldn't necessarily expect a high degree of correlation between the mRNA expression and ribosomal RNA expressions. All right, so visualizing the results. How are we doing for time? Oh, I'm going over, sorry. So in terms of visualizing results, I don't know how many of you have been reading a lot of these microbiome papers, many genomic papers, and you see figures such as this where these might be gene ontology categories or they might be egg pathways or they might be some kind of functional categories and get inspired charts. And you think, well, this really isn't very informative because what does it mean that transcription? What are the genes that are involved in transcription or lipid transport? What does that actually mean in terms of these kinds of pie charts? So I've always found these kind of pie charts looking at the different functions that are expressed relatively unsatisfying and not very informative in terms of really getting at what is happening within your sample. So we've been trying to push more of these kind of systems-based methods. So the idea here is that genes aren't operating in isolation. They're really parts of these highly interconnected pathways or complexes. And so if we want to interpret, for example, gene expression in one particular gene, we need to know is that also happening in functionally related genes as well. And so does that give us additional support that this particular process is actually being upregulated in our particular sample? So there's a number of different systems, data sets that we could use to interpret the action, maps, other ones, pathways is probably the best characterised. We really have a good handle on enzyme relationships and enzymes involved in some of the biochemical pathways, the metabolic pathways are pretty good, signaling network, and so forth. So the idea is that can we place our metatastrophonic data within the context of these kinds of networks to get more at the root of what's actually happening in terms of microbiome function at this kind of more mechanistic level? So one approach that we've been doing is relying on an E. coli protein-protein interaction network. So protein-protein interaction networks tend to be relatively expensive to generate, so we don't have a comprehensive set of protein-protein interaction maps for all the bacteria that are out there that we're gonna see in our samples. So we're really just using this E. coli map as a surrogate, as a scaffold on which we can layer on our metatranscriptomic data. Okay, so you can think of this as a template. And what I'm trying to show you here is that the abundance that we're getting out from metatranscriptomic data sets isn't necessarily just a relationship of conservation. So this is an additional consideration that we might want to take into account. If we're trying to enforce our genes that we know are coming from lots of different bacteria onto one single scaffold, there's a homology matching step. And so those genes which are really highly conserved, you're going to be able to match easier than those ones which aren't very highly conserved. Or we can see that there's very little correlation with some more biogenesis gene between the conservation and what we actually find in terms of the data set. So we're kind of satisfied that conservation, relative conservation of E. coli genes compared with the other genes that we have in our data set and metatranscriptomic genes, aren't necessarily impacting our ability to look at the relative abundance of these genes. The other nice thing about placing these genes within these kinds of networks is that we can start thinking about more statistical analyses. So if we're thinking about comparing up and down regulation of genes in one sample compared to another sample, one route is just to look at the expression of individual genes. And typically, because you have a large number of genes, your power, your statistical ability to actually find genes that are differentially expressed with statistical meaning is really very poor. However, you can boost your statistical signal by grouping things into similar functional categories. So this is where this gene set enrichment analysis comes in. So we can look at groups of genes that might be function-related or by using these kinds of networks are they connected with each other within the context of this network? And do we find that genes that tend to be functionally connected, they have these interactions with each other, so they tend to show similar patterns of expression and this can significantly increase your statistical support. Visualizing results, I think we're into the MG REST. We weren't very enthusiastic about it. Worry? Okay, why? It just takes a long time, or you're waiting what, three months for your dataset to come back, or? Yeah, my initial use of course kind of hurt. Okay, all right. Okay, so there are these tools out there. There's NT REST and then there's this one called Megan. Megan is quite interesting. It has this, so these head pathway maps, these metapodic pathway maps that want to citrate the DCA cycle. And I think the nice thing that the Megan does is it can color your inside of this pathway and show you which ones are up or down regulated. That's kind of nice. However, the issue, and this is again, the issue of these kinds of head pathway representations is the Keg database has really been based on and these maps, these pathway maps that you see, have really been based on the curation of three different organisms, E. coli, yeast, human. And so are these different maps really representative of the pathways that you have in all these other organisms that we're sampling and we have no idea? So rather than using these kinds of static maps, we're kind of interested more in these network approaches. So we might, so we have that kind of a global network with enzymes and we can identify where two enzymes share the same substrate. We've now got a possibility of flux through those two enzymes of that substrate being processed through those two enzymes and leading into different parts of the pathway. So these kinds of additional possibilities with any network just aren't captured in Keg. So that's why we're interested in using these kinds of more network approaches. The other thing that we've been looking at is the use of these kind of pie charts. So pie charts are associated with a single enzyme and then the size of the pie charts is larger and more expressive and then the colours indicate the contributions of different taxonomic groups towards the expression of that particular enzyme. So this enables you to identify which particular taxa appeared to be actively expressing certain types of functions within your network. So again, you can start identifying key taxonomic contributions. And then we also have these protein-protein interaction maps that we have from E. coli and again, we can create these kinds of pretty views and this will be part of your tutorial just kind of playing around with these kinds of views and looking at layouts and getting an understanding of how different groups of genes that are functionally connected. So here these are genes involved in several biogenesis and then we have the sec payload division gene that we talked about. These are involved in I think my synthesis is that there's a large contribution of the taxa. So it might give you an idea as to these are the taxa that are responsible for producing the majority of this particular cell wall component within your microbiome which may be triggering something associated with your immune system for example. So these kinds of methods are now starting to enable you to drill down into identifying taxonomic contributions of certain key functions which you can then as I keep trying to suggest design more robust experiments and are more focused on specific sets of genes. Okay, briefly mentioned some statistical considerations. No dedicated software or statistical tool for statistical comparisons and many transcriptomic datasets have been developed so far. Number of biological replicates. Again, this is an issue. We covered this at least two, preferably at least three but again these are expensive experiments and so we are kind of stuck. But if we want to get a lot of meaning out of this and have a colleague that we have a lot of arguments with at Sick Kids and he's adamant that you need at least four of these experiments and we can say we can't possibly afford that Andrew how are we gonna do this? He goes, well I'm not gonna get anything meaningful for you and it's kind of like we're gonna do something, we've got the money. So what are you gonna do? So fortunately again emphasize the bar in these kinds of studies is relatively low and think of this really as discovery science. If we are going to think about trying to get some statistical support for genes whatever there are, we can adapt tools such as the ECHR, it's a differential expression of these two genes. When we've done this we've actually only found a handful of genes that really fall out from these statistical analyses just because you have so many genes in there and doing the multiple correction testing. On the other hand we've had a lot more success with these genes that are rich in analysis so by combining things into the same function you're really boosting your statistical power and two methods that you could use to bring these genes and analysis one is to look at the results of programs such as the ECHR as to which genes are differentially expressed with some kind of statistical support. What we've found works pretty well is just looking at fold change between samples. So you look at fold change for each of your genes and they're involved in the same kind of process and then you can perform a gene cell enrichment analysis and we seem to have reasonable success in doing that. Anyway, have I had this hope enough that it should be used in side-by-side in order to require a subsequent, I don't know. All right, so just to go over these tools again the ECHR, there's a new one that, where it's relatively new called Aldex II so this has been developed by Gregory Glauer at the London University of Western Ontario. And the problem with the ECHR, these were initially tools that were developed with microarrays in mind and they made certain assumptions that don't necessarily hold true for RNAseq. So Aldex II overcomes a lot of the assumptions that DC and HR actually made. Some challenges might be which genes we actually include so do we specify a minimum IPKM cutoff in a similar vein as to which metagenomics we used to re-include, to re-just exclude those where you only have one read, for example. And then once we identify these differentiating expressed genes, we can analyze them through gene cell enrichment analysis or we could use just fold change. And I don't have a slide saying any questions but, yes. Can you try the Kalisto and Sleurin's software? Nope. They're pretty recent, haven't used them. Like I did RNAseq in the past and I used like HR and ECQ and stuff like that. Yeah. But like the problem with these software is like it's supposed to map to the gene and that ties with worms, right? But apparently like this is resolved by Kalisto so you get this isochronous resolution. So it might be like a better field for you and mean divers. Okay. Good to know. So we tried HR, DC, Aldex on these mouse samples and I think we had four from a colon, four from a secum just to see what was differentiating expressed between the two of them. And I think we got maybe three genes with HR, maybe four genes that didn't overlap with DC. Can we got nothing from Aldex? So that's a challenge that you're facing is that these tools just aren't picking up anything. And again, it's probably just doing the number of biological replicates. But there we had four, but very different mice as well. All right. Questions, yes. What are the target evaluation? Is that fields, what are the general two? I don't know, that sounds like a great idea. Yeah. I think again, it will depend on the samples that you're collecting because if you get certain samples and you really struggling to extract RNA from those samples, you're not gonna have any left of the validation step afterwards, so then you have to go back and target another cohort potentially. So like sputum samples, for example, it's quite hard to actually extract sufficient RNA to get enough for the sequencing. You have a set of samples before you go through, during the filtration, is the host you made. Do you do that after assembly or before assembly? Which... I think we do it before assembly because if we have a host genome, your ability to actually annotate those reads to that host genome is pretty high. And it's a relatively quick step. So again, anything you can do to get rid of the stuff you're not interested in as soon as possible is really beneficial. Very similar. And now when I do a mock-up of some tests on the relative of one of the samples, I know that it's not always accurate. Have you tried Stamp? You should, yeah, we'd be good to test that and see what its performance is like relative to DEC. It's probably awesome, isn't it? Yes, we are. So the two main projects that we're pushing at the moment, one is a study of kids with IVD from the South Asian community. So these are people who are immigrating to Canada from South Asia tend to show a decreased risk of IVD relative to European Caucasians that are here. Just when they immigrate, so that they have a decreased risk full stop. But the kids that are born to those parents have the same elevated risk of IVD as the European Caucasian kids. So there's something in the environment that they're picking up, or this is a hypothesis that is contributing towards the development of IVD. So we're looking at, and again, it's to do with costs. So these are 40 families. So the power is relatively low, but this is one of the few cohorts where we can actually get biopsies where we think that these more informative sites that are associated with the inflammation of the gut. And so this is gonna be generated by the transcriptome data from two different sites within the guards of these patients and then doing comparisons with metabolomics and 16S and metagenomics and so forth. So it's really just a discovery exercise of what we can actually do with this kind of technology and what we might be able to learn. The other study that we're doing is chickens, which is a little bit of a hard sell at a children's hospital, but again, it's to do with inability to get the more informative samples. So if we want to understand something about the relationship between diet additives, in this case these growth-promoting antibiotics that we hear a lot about and we need to eliminate them from livestock within the next two or so years, we need to find alternatives. And so by working with chickens, we now have the ability to have a fairly standardized set of diets that we don't have with humans. So you're minimizing some of the variables that we have with these studies. More importantly, with these chickens, we can actually get the samples of different sites for the intestine at different time points that because using different patches of chickens that you just can't get with humans. So that's where we're currently headed. All right, so we are currently on a 10-minute coffee break.