 This morning it's the very exciting theme of Metatranscriptomics. And again hands up for those of you who are interested in doing some Metatranscriptomics. Those of you who have zero interest in Metatranscriptomics. Yes, yes you should all be interested and hopefully part of my goal is to convince you. Why is that popped up is to convince you that we should be thinking about doing Metatranscriptomics in our datasets. And Ryan who's just walking in there Ryan will be taking you through the tutorial. At the end of this talk. Right, so module six Metatranscriptomics created commons open source free to share use everyone is see fit these materials. Okay, so what are we hoping to learn today. So, I'm not wearing the mic so I'm going to raise my voice then I'm fine without mic. So at the end of this module. As I say one of the main things I want to get across is maybe a little bit more of appreciation for Metatranscriptomics understand its capabilities what it can do that say 16S and metagenomics can't do. I want to go over some ideas behind sample collection experiment design and so forth. And then hopefully picking up. We're going to go through our pipeline which called Metapro. It finally got published last month hooray for us. Obviously we promote our pipeline but you should also feel free to think about creating your own pipelines as well because each of us may have slightly different needs. And the tutorial itself will take you through processing a simple Metatranscriptomic data set. And then at the end we have this kind of visualization kind of tutorial which is based on side escape. Who's used side escape before. Okay, well that's great that there's only three of you. So, so it should be new to everybody. We just find it. It's generally really useful tool for just during visualizations of all sorts of different data sets. And so if you can think about a heat map, rather than presenting a heat map for your data you can actually present it as a network. And so it gives you an alternative way of kind of presenting your data helps you maybe interpret your data a little differently as well so beyond justice Metatranscriptomics we think putting you through this kind of exercise working with side escape. You might again just find generally useful in your own research if you're playing it in other context so it's a tool that's been developed by a colleague at U of T, as well as a somebody I think he's at UCSD Trey Ideca Gary Bader is a colleague at U of T so he's been working on side escape since about 2003 2004 very well supported lots of plugins and different stuff that you can do with that but we go through that in the tutorial. Okay. Metatranscriptomics why Metatranscriptomics so 16 s, we know it's good for telling us who's there, but it only gives us relatively limited mechanistic insights. So then Metatranscriptomics, which was all day yesterday we're very excited about Metatranscriptomics costs of it are starting to come down a little bit. The ways that we're using Metatranscriptomics data is starting to become a little bit more standardized as well. And it's really good at identifying functioning determining differences in functions across samples. As I say what I want to try and get across today is Metatranscriptomics where it's not just what functions are present in a sample. It's which ones are actually active so it's telling you something about what is the active function of a microbiome, or rather who is doing what. The idea behind Metatranscriptomics so similar to Metagenomics where you're just doing whole shotgun DNA. Here we're doing whole shotgun RNA. And so this can tell us which genes which pathways are actually being actively expressed from in the community. So here is a this is a science, science, a visualization these are genes that are involved in cell wall biogenesis. These pie charts, the size of the pie charts are representing the relative expression of each of these genes and cell wall biogenesis. And then the breakdown are showing the taxa that are contributing to those functions. So these kind of visualizations really give you an idea of who is actually doing what within your community. So once you're looking at relative expression you can look at changes in relative expression across to the data sets and so the red arrows here indicate those genes that have been upregulated in this case this is a sequel sample from a chicken. And this is between chicken which has been given antibiotic growth motor so these are these antibiotics that you put in livestock feed. So if you give the chickens these antibiotics in their feed and no and behold a lot of these cell wall biogenesis genes expressed in the microbiome actually going up which is potentially not that surprising but it's nice to see this kind of demonstration that you can see how a microbiome is responding, not necessarily in composition but in terms of its activity to some kind of external perturbation. Right, so to give you an idea of the sort of things we can do with metatranscript, metatranscript so it's what we can learn using these kind of data sets. This is a study that we published back in 2017. It's looking at this peri-lipin-2 gene so peri-lipin-2 it's a gene that's involved in lipid uptake in the gut. So we have a lot of colleagues, Dan Frank, at the University of Colorado, and they have this mouse model, and they found that in this mouse model deletion of this plin-2 mice largely abrogates the negative consequences of a high fat diet. So we're interested in understanding well what impact does this plin-2 knockout have on the microbiome. So we did this relatively simple experiment. We had four sets of mice. We have two diet, high fat diet, low fat diet, and then we have two genotypes, the world type mice and then the plin-2 knockout mice. And then we applied this whole microbiome RNA-C, this metatranscriptome, so we generated 20 to 30 million reads per mouse. And then we looked at the composition, so plin-2 high fat diet, wild type high fat diet. We saw that there is no difference in community composition. So they effectively had under a high fat diet, both these types of mice seem to have similar community composition. So just at the gene expression, we found that despite having these almost identical communities, they did have a significant difference in terms of the genes that they're expressing. So we identified about a thousand highly expressed microbial genes that were differentially expressed between these two genotypes. Okay. So this genotype, this genotype is having a difference is altering the function of the microbiome. So the microbiome is the same, but the functions, what's the pathways have been expressed are actually changing. Many of these differentially expressed genes were associated with amino acid metabolism, energy metabolism, and we were particularly interested in this particular pathway. And what these kind of dots represent here, these are enzymes, and the size of the enzymes represents the average expected. Oh, okay. Oh, I'm not allowed to move. I see. Okay. So when we, when we map these kind of gene expression differences in the context of this like this pathway what we see is this large track, going from, what is it one fructose six phosphate, or the date all the way down to pyrovate this is where a lot of ATP is being produced in this particular pathway. These two knockout mice are actually down regulating this part of the pathway. They don't seem to need as much ATP production. And so this helps us kind of think about building some kind of model where in a wild type mice, you have these facts in the diet these facts are being absorbed by the intestinal cells that you have this clean to this wild type clean to you have this regular clean to these liquids again absorbed. So not much is necessarily left over for the rest of the microbiome and the lumen of the gut. However, in the plane to knock out mice what's happening is that you're not getting the same uptake of facts. As a consequence you have more these triglycerides in the lumen of the gut. And because you have this extra availability of these triglycerides, there's more than sufficient energy for the micro for the converter. And as a consequence they can dial back their energy metabolism because they're already producing a lot of energy, and they can switch. They can switch their expression into other types of pathways. So this is an example I think, and hopefully a convincing example where we've used metatranscriptomics to actually understand and starting to get at some mechanism of why changes in in this case it was a different gene type in the mouse a knockout of mouse, how that actually results in changes in the expression of certain pathway functions. Okay, so hopefully that's a little bit convincing for you. So we look at the uptake of metatranscriptomics so on the left here we have publications with word microbiome on the right publications with the word metatranscriptome metatranscriptomics or metatranscriptome. And if you look at the y axis, we can see it's still a very slow uptake. Okay, in terms of how many publications how many people are actually using metatranscriptomics in their data sets. So maybe it's starting to catch on. And I think maybe one of the biggest drawbacks of metatranscriptomics is one it can be a little bit challenging because you're dealing with RNA and not DNA and secondly, it is more expensive. So library preparation costs are probably the main expense and it's about $250 per sample at the moment. Now we're hoping that these costs can come down. And the Sanger Center who are doing very high throughputs, tens of thousands of samples for metatranscriptomes. And so they're really motivated to try and bring these costs down. And hopefully if they can they can come up with a new ways to do these like preparations that will bring the cost down for everyone else. So, I guess we're at a stage now where it's a balance between well how much do we want to spend how much can we learn versus the actual cost of these kinds of experiments. Okay, so I have examples of metatranscriptomics on the left this was a paper that was published last year in Izmi Journal. Here they were interested in looking at soil microbiomes to isolate genes that would be useful for the processing of arsenic. And relying on metagenomics, they actually didn't care if an arsenic gene was actually present or not they wanted it to be actively expressed and they use that as a criteria for saying this is the kind of gene that we want to focus on because it seems to be expressed it seems to be used within the context of this microbiome. On the right hand side, this is from Curtis Hutton hours group this was from 2018. Here they applied metatranscriptomics to IBD samples they found that specific taxa contributed unique pathway expressions they're starting to really did dissect the individual contributions of different taxa. This is a taxa that are very abundant from the metagenomics point of view, but are actually quite quiet, they don't seem to be very active. And so there's this concept with metagenomics that we may be able to sample all the DNA in the sample, but is that really reflective of the activity is this DNA that's just been shared how reflective is metagenomic analysis of the metagenomic kind of data, telling us in terms of what is functionally happening in in that particular data set. The other thing that metatranscripts a metatranscriptomics enables us to access that metagenomics doesn't are RNA viruses. This is a, this is a study that we did recently on mice and we recovered these, what are they astroviruses. And it turns out that our pipeline does a really good job of out of these hundreds of millions of sequence reads actually reassembling entire viral genomes. So it seems to be a really good way of being able to identify novel RNA viruses, and a colleague from U of T are Tim, but by and he's just new faculty member. He published this paper just last month where he used all these metatranscriptomic data sets to expand this universe of RNA viruses and obviously there's big implications for future pandemics and so forth. So it's kind of a novel aspect of metatranscriptomics given as access to information that metagenomics just isn't able to provide us with. Okay, so how does this work well it's pretty much the same as an as an RNA seek experiment to extract the RNA, your fragment is sequence, you align to know in transcripts, and you end up this digital read out of gene expression so it's very similar to single organism RNA seek. But it does have its own challenges. So in a typical kind of RNA seek experiments you apply RNA seek to maybe a single eukaryotic organism. And if you're dealing with eukaryotes you're able to isolate the RNA through the poly A tail. Now unfortunately bacteria don't have poly A tails. So this means that you have to sequence all of the RNA, and all of the RNA includes ribosomal RNA which makes up about 95% or so of the RNA in a typical bacterial cell. And you've got to get over the fact that how do you enrich for this messenger RNA. Another factor in RNA seek is you generally have a reference genome and so you're able to map your sequence reads that reference genome. So for applying meta transcriptomics we generally don't have that reference genome, and so doing this mapping identifying what the source these transcripts from can be quite challenging. So the plan of challenges that we're facing is, first of all, compared to manage no mix DNA is relatively stable RNA, not very stable so you have to be, when you're doing the isolation of the RNA happy working in very very clean conditions, making sure all of your, all of your surfaces are wiped down with agents that are going to contract RNA is. We also have this lack of poly A tails is also host contamination which was briefly touched on yesterday so the host is going to have a lot of RNA as well and depending on the sample that you are sampling from you can get a lot of host contamination from human's that could be problematic from an REB perspective in terms of screening out those kind of human reads before you're something to get. But at the same time there's a possibility of using the host RNA signal to get an idea of what is the host actually doing how is that responding to the microbiome as well. So it's kind of a, it can provide some some some benefits as well. There are very, very complex data sets featuring hundreds or thousands of different taxa. And so this is question of what is the depth of sequencing that we need to do in order to best sample from all the different taxa that are within our, within our particular sample. And also as I mentioned we have this lack of reference sequences with regenerating sequence from strains that we've never encountered before. And so how problematic is that when we're doing a mapping of these reads back to their original transcripts. So the first challenge that we face is the kind of instability of RNA so RNA quality and deteriorate very rapidly. And so the ideal is to try and get your sample to minus 80 as quickly as possible. And that we're involved in Pakistan with through the Aga Khan University they're working with these field sites, and they are able to get a stool sample to minus 80 within about two and a half hours of collection. Okay, so they have a really phenomenal way of being able to manage the collection and maintaining the integrity of the samples in Toronto. And unfortunately we don't have that same level of capability. And so we're working with newcomer communities a lot of these newcomers. And this is these are young women that are pregnant. A lot of these newcomers are living in shared housing they don't have access to a fridge and so we had to identify a solution, where the RNA could actually be preserved at room temperature so that they could send it through the mail to us. And there's a number of kits that have been developed so Zama research as this DNA RNA shield, nor gen has a kit the one we've selected is this omni gene gut kit which was just released last year, which is supposed to be very good at maintaining the integrity of RNA in samples, I think up to about a month or so. At that stage hopefully have them put on minus 80. One thing people have been trying to suggest is RNA later. We find that RNA later seems to interfere with the library preparation kits and so we don't advocate the use of RNA later in trying to store and maintain these samples. So that's how much of these kits perform so Zama research, nor gen this is interesting the color bars at the top are sharing, for example with the shield or with the kit that the kind of the breakdown in terms of taxa that you recover is as good as the original sample, what they're actually showing those DNA and they're not showing RNA it's only the bottom kind of. They're really showing how they're better able to preserve the RNA is only DNA genotech on the right hand side which is showing the RNA breakdown in terms of taxonomic composition, showing that this DNA genotech kind of kit seems to do a pretty reasonable job of maintaining the quality. So you might notice from this summer research for this RNA later here this thing here, no recovery of any RNA so it seems to really poor job at least under the conditions that they were doing their trials. So we think that this DNA genotech hit simply pretty reasonable and that's the one that we're currently using for our studies. One of the reasons I think maybe metatranscript tomates isn't adopted or isn't as widespread as you think it might be is that it is expensive it's largely because of this library preparation. So Illumina has this joint kits which does this kind of library preparation for you and it's about $250 a sample. And then on top of the sequencing cost it works out to about $300 $400 sample, but the main costs again is really the generation of the libraries. And if we can come up with some kind of homebrew solutions or better solutions that don't rely on these expensive kits and that could really drop the price down. One thing that can potentially cause a lot of issues in terms of cost is how many record cuts do we need. And again, this is a pretty nascent field. And so it's very hand wavy. We are going to have I think a little bit of a discussion about power calculations this afternoon with Anita. But these are kind of very challenging analyses to do because we just don't have the data there aren't the number of studies out there that are able to give us the statistics that we need to come up with more compelling kind of power calculations. So in terms of mice, we're suggesting at least four it's probably six is probably a good idea in terms of replicates. And these mice are under similar conditions and the microbiome isn't changing. If you're looking at individuals where there's a lot of variation, then you're probably looking at a minimum of 40 for the individuals in order to get at some kind of signal that you might be able to detect. This is very hand raving numbers and again Anita I think we'll be discussing this more this afternoon. The major challenge that we face with RNA data sets is this large amounts of this ribosomal RNA so as I mentioned it can make up as much as 95% of all of the RNA within a bacterial cell. And so there have been kits that have been developed, which decrease and filter out these highly abundant ribosomal RNA species. These kits do put in biases, so they seem to be better at some tax than other tax if we're removing the ribosomal RNA. Overall, they do a fairly decent job. So the current kit that we use is this ribo zero kit. So this is part of this Illumina library preparation so it does the ribo zero depletion and then it also does the library preparation at the same time so it's all in one $150. That's fantastic we should all be buying shares in Illumina. One thing we find with these kits though is again they're designed for bacteria so on the right hand side. This is a study where we're infecting mice with a protest. The first eight lanes. This is a naive mouse it hasn't been infected with a protest on the right hand side it has been infected with a protest, and that big kind of dark blue bar is ribosome and RNA associated with a protest. So when we're thinking about these kits that we're applying to deplete this RNA. They're not going to work for the certain taxa that they've been set up for but they're not going to work for other taxes so if you're interested in certain types of organisms that may not be captured by these kids then you are going to run into these problems where you are going to have a lot of ribosome and RNA in the future. And then just to mention that we do get hosts we do become a host message RNA is. And while this can be challenging, it can be informative so you don't necessarily have to throw out those reads you can actually use them for an additional part of your analysis. Any questions on any of this so far. I'm not seeing any hands get a yes, it comes down to your budget so you know I think most of our experiments we run a max of about 32 mice. And so it's okay what are we going to gain most from these 32 mice, because that's going to cost us about $18,000 or $20,000 right. I think the key is to think about this as hypothesis generation. The statistics aren't necessarily going to give you a definitive answer but if it's starting to come up with the pathways and I think this is more of the powerful approach is the individual genes could be very flaky when you start putting them into context of gene sets and I'll mention this and a little bit as we get to the end of the talk. I think that's where the power of these analysis coming because you can see entire pathways or entire complexes or biological process and again up regulated or down regulated. You may not care about the individual specific genes but you just see that entire wave of function going up and that gives you confidence that what you're seeing really is real. And then you can design for example specific DNA primers just a PCR up on some of the genes that you want to follow up on, maybe do a larger experiment larger number of samples route paying the $18,000 for this. So you can think of this more as a discovery tool, and then going back and reperforming an experiment with some more kind of focus but cheaper kind of experiments. Yeah. So you're basically like sending them off and then they're putting it in the kit and then dealing it. That's a really good question. So in our Toronto cohort, due to COVID slow down research coordinators, leaving us for more, more kind of richer pastures with private CROs. We have recruited total of one individual. I think it took one week through the post for that sample to actually arrive and then we've suck it at minus 80. He can always use FedEx. I'm not sure that, you know, with the added shipping. Yeah, so then again it starts it starts mounting up prayers in Pakistan you've got actual people in the field that are able to collect the samples and then take them to stick more minus 80 straight away so yeah we are at a bit of a disadvantage. The other thing I want to mention is this concept of absolute abundance and quantifying absolute abundance. So how many of you are aware of this issue of relative abundance versus absolute abundance. Yeah, so I think there's more and more appreciation that we need to consider this in our analysis, whether it's 16S whether it's metagenomics whether it's metatranscriptomics, more and more of these studies and reviewers are asking for well can you quantify what the actual level of material is in your actual sample and how does that go from sample to sample. So there's number of ways that you can do this. There's flow cells so you can actually count each of the individual bacteria as it as it's going through and that can give you an idea of the abundance of bacteria within that sample. Now they see if they see if you count so you could just play things out and look at one tax on that you know is going to grow under those plate conditions and see how that varies. One that we have used before is this ZYMO spiking kit and we had pretty good success. This was a trial of chickens and what this spiking there's two tax are in there. So these tax are supposed to be taxed that you would never find in a wild sample so you can I you can easily identify them distinguish them, and those are going to provide you with as you're doing your DNA extraction you're adding this spiking right at the very beginning of the extraction step to your stool sample, and then they're getting processed in exactly the same way at each step. And then during the bioinformatics you can actually identify which of your reads have come from those two tax ring use that to quantify okay given this level of reads that we spiked in. They're representing 10% of the sample 50% of the sample, then that gives them a way of saying well how many other bacteria are in that sample, and it does a pretty good job so this as I mentioned we did this in chickens. So the top the top row is a breakdown of taxa from a relative abundance point of view in the bottom set from an absolute abundance point of view so it really is able to quantify differences we find that for example the genome has roughly about 20% of the number of bacteria in terms of density relative to the cecum, whereas before we know you naively assume that they're kind of equivalent. So it's nice to see that you do get this kind of lower level, lower density of bacteria in the samples where you might expect to see that. So I think that this is important. So at the top, this is a readout based on relative abundance that kind of dark red tax on at the top might suggest that at 24 days post hatch. Those, those chickens have seen a decrease in bacteria AC which is the dark red tax on. However, when you account for absolute abundance is actually no difference. So unless you're accounting for this absolute abundance you can get very misleading results and you can identify taxa that may appear to be up or down regulated or a natural fact when you account for this absolute abundance. Then, there's actually no difference or they might be moving, even in the opposite direction. Yeah. You're ready to go. Possibly, I'm not sure I'd want to go that far just because of how things are collected at a later stage and patch effects and so forth but potentially it might give you a little bit of an ability to do that, but I guess we need to start seeing meadow and we need to do this based on several different studies before we can start making those kind of claims. Yeah. Yeah. And the other thing is that we were discussing a couple of weeks ago was, well, maybe it's not the density of bacteria. There are absolute abundance here, but we really don't know what the quantity of material is in the entire intestine and it's somehow where have you captured that. But that's a whole different question, right. So, I don't know if anyone's looking into the total quantification of stuff within somebody's guts, but I imagine that that is going to be kind of interesting and probably going to have some kind of functional ties associated with that. All right, generating reads, Hamley reads are enough. This graph on the left hand side this was from a relatively early study where we compared four different metatranscript tones and we're looking at how many enzymes do you recover at increasing depth so it's kind of a rarefaction curve to show when you get saturation of in this case enzymes we find that around about 90 to 95% of all of the enzymes as measured by these inside commission or these EC numbers can be recovered by about 5 million messenger RNA reads and it is going to depend on the complexity of the sample, but even a deep sea sample at about 5 million reads we're seeing that we get around about 1995%, however, depending on the complexity of the sample, given that the cost of sequencing have come down, we generally do in the region about 40 to 80 million reads per sample but again depending on if you're after relatively low abundant taxa that you think might be contributing important functions and obviously you have to sequence much deeper. So most of these experience with RNA seek with this RNA seek with this metatranscript and it's really reliance. Not so much on pack bio we don't care if the reads long or not I mean these are transcripts they kind of, you know maximum of maybe 2000 maybe 5000 base pairs. So we're really after the digital readout what is the relative abundance and so we just want these little tanks, these short reads and so we really rely on so on the luminous sequencing, and particular the Nova seek platform, just because of the quantity of data it it can produce. There's probably some application of pack bio and long reach it if you really want to get a good, I guess idea of the diversity of the actual transcripts. So most of what we've been doing I really relying on trying to come up with this counts for these kind of digital readouts of the relative expression of these genes. Okay, so we've gone through sample collection storage processing sequencing now we have the data. This is where we're coming in. We are generating data sets and Laura mentioned a publication yesterday with 300 billion reads. So certainly we've been a transcriptomics we're generating billions of reads per these experiments and so we really reliance on these kind of compute clusters. So some of our data sets. We run into problems we run out of memory. This is, this is a bit of a challenge. These are very complex data sets we have to do comparisons against very large databases, the amount of memory we need per node is hundreds of gigabytes. And so we really are reliant on getting access to these compute clusters and processing can be slow I mean we're talking maybe months for processing some of these data sets so there's a lot of compute involved. So the data set that we're showing you today again as Laura mentioned, a lot of these separate actually pre run for you, just because it's just impossible to actually run these on our laptops and desktops. So I mentioned that we had a meta pro pipeline just published recently there's a couple of other pipelines that we compared against so human three is one. And there's one called SAMHSA as well which is, which has been mentioned some comparisons next couple of slides. So my dear it's very similar to processing meta genomics reads you've got your, or your filtering of your low quality of your adapters but we also have an additional step where we need to filter out ribosomal RNA is because we know that even applying these RNA depletion kits, we're still going to get RNA creeping through. One of the tools that we use for this is one called infernal. There's another one called sort me RNA sort me RNA is great if you ask the quick, but quick could be writing a program saying move to the next step, and that would be quick but it wouldn't be very effective in the same way sort me RNA only gets about 50% of the ribosomal RNA that infernal does. And the problem with infernal is that it is slow. So this is again one of these steps which slows down this kind of processing and again it's one of these steps that will be skipping. So he uses these hidden Markov models to actually identify and it's very sensitive way of identifying ribosomal RNA is in within your sample. But as I mentioned, other pipelines are using this sort me RNA and I'm afraid it doesn't do that great a job. So we're going to compare our tool Metapro with human three with Sansa to and basically all of these tools are kind of wrapper scripts that are using established kind of tool so you could easily go ahead and build your own pipeline and maybe run it through step by step by step. We provide Metapro it's in this Docker container and this Docker container means that it should work in any architecture so you just copy the download the whole thing and it should just work this shouldn't be much in terms of getting installation to get it up and running. On these compute clusters there's a programming environment called singularity who's heard of singularity. Hey, some of you have heard of singularity so singularity enables you to use these Docker containers and have one of these Docker containers running on each node of your supercomputer. And so you can set up thousands of these jobs running at once so we've really set this up to do this is in as much in parallel as possible to reduce processing times. In terms of comparisons with human three. This was a kimchi data set that we looked at so you can see the ribosomal RNA contributes a large, a large number of reads in these particular samples and all three tools do reasonable actually same to two does a terrible job of doing ribosomal RNA and I think that's because it is using sort me RNA, yes. Right. So, with the ribo zero depletion. I think we are able to get down to about 30% of the reads are over some RNA. So that means you've got 70% of your say you're doing 50 million reads 70% would be 35 million reads are pretty good so that's kind of pretty saturated but again it comes down to the kind of questions that you want to address and I think because the cost of the the cost of the sequencing is not the limiting step here. So much as a library preparation costs. So if you're already spending $250 on library preparation, then spending an additional 50 or $100 to get from say 20 million up to 60 million reads is probably worth the extra investment. No, I think you just go ahead and aim for about 50 to 80 million reads because that seems to be the bar at the moment for other studies. It's not not very useful but you you are starting to see stuff in those samples that you're not able to see when you're just doing 20 million reads. Sometimes you get a bad sample as well as that even if you've generated say 50 million reads you still only end up with 5 million just because of something that's happened within that ticker sample maybe the RNA depletion didn't work quite right. So there's a rule of thumb re-aim for as I say between about 40 and 60 or 40 and 80 million reads and it again it just comes down to what you're happy with in terms of cost. Okay. The other thing to note with our pipeline so the things that are interesting are the things that we can annotate these represent real bacterial genes and we can see that metapro quantifies and annotates stuff that the other two pipelines just says I don't know it's something but it's not a gene or anything we metapro actually annotates 50% more in terms of the genes that it can identify. I should mention that human three is really built as a metagenomics kind of pipeline so I've been a little bit unfair doing this comparison with human three, but they are supposed to do this kind of capability of doing metatranscriptomics analysis as well one drawback I find with human three is that it's not terribly transparent in terms of getting intermediate data sets that you might want to get at in order to see. The enzyme here could be really interesting in terms of its expression. How many reads actually mapped to that particular enzyme and getting that information out of the human three data is not that easy. I don't know if Morgan has any comments on human three. Okay. So one important step that we find for metatranscriptomics is assembly and we find that if we can assembly this really improves annotation accuracy. So again the whole point of metatranscriptomics we want a digital read out of gene expression we reliance on these short reads. But because we have short reads that limits our ability to annotate them because you've only got 150 base pairs to map to something else. And by that you need to get in the region of about 200 or so base pairs, really to have a good chance of matching something accurately to something in the database. So we use RNA speed RNA spades to build these kind of transcripts and then we use meta G mark to separate these into individual offs, which can then be, which can then be annotated. And then we go into chimeras. These can occur because you get all logs from different species and they're kind of mixed together. However, we did an analysis of this number of years ago we found that in these data sets only about two to 5% of what we're assembling actually turns out to be chimeras so it doesn't seem to be a problem that we're too worried about. So that depends on the complexity and depth of sequencing in your sample. But for a chicken microbiome, for example a chicken metatranscript home. Assembled reads are in the region of about 10 to 20% and then unassembled reads are probably in the order about 60 70% something like that. So it is. It is the case that we do have a lot of these kind of singletons that we just aren't able to assemble. Yes, still try and take them absolutely. Yeah. Yeah. I think the cost is too prohibitive when you. I mean what is the my seek might want my seat run can do what about 40 million reads of that. My seek is quite expensive as well isn't it relative to some oversight. Yeah. Yeah. Yeah. Okay, so we've done the processing we've done some assembly we've identified the genes. So now we have the genes that are coming from the assemble conflicts that we split up into individual genes and then we have the singletons as well, which are also coming from their own kind of unique system. How do we annotate all of this. So our pipeline uses this kind of three tiered approach. The first uses BWA to try and identify these kind of strict matches so things that are pretty much identical to something in the reference database very fast. So diamond is black. This is fastest less strict than BWA. But it is better than blast or we actually use diamond. So diamond is slower than Latin BWA. It is somewhat. Yeah. It's less strict than black for sure. But use it in place of blast just because it is so much faster. But by using this we're kind of hoping to speed up this entire kind of annotation pipeline. It comes out with the sorts of matches that your teeth would curl at it's kind of like well this is, this is a score of five it's not e to the minus five it's a score of five so those of you run blast searches, and generally to get a match to something that you deem the homologue use a cut off at some like e to the minus five. Here you get a score of five it's kind of like well that is that really significant. We all rely on the fact that we look at percentage identity over the actual sequence. So these are relatively short sequences that's why our p values are not super great. But in a way, we would argue that it doesn't really matter if we're not getting to the exact gene that from the tax on that we might associate with. Instead, we're more reliant because we're using this more from a functional perspective. Then if we can get a match to a gene and that gene is associated with some kind of enzymatic function doesn't matter which which tax on it's coming from. We care more about maybe what that function actually is and so doing these kind of matches on this kind of peptide. So translating the DNA sequence into peptide space gives us with the kind of percentage identity cut offs and so on and so forth, gives us more confidence that we're mapping to something that is functionally, that is functionally related. And so we're really, I guess at this step really really focusing more on the function rather than rather than the actual taxa. In fact, we have a completely different approach. So this is purely in terms of functional annotation trying to identify genes and and what their functions might be. Now one of the problems that we're facing and this is this is why these pipelines are getting so bloated and slow down and taking months to run are the size of these databases so the reference genomes are always increasing. The memory to perform these searches is really increasing as well, which is requiring these large memory compute clusters, where each node has access to hundreds of gigabytes of ground. Now there are software solutions where we can spit databases. One thing that we are now starting to experiment more with are these custom databases. And so what we're showing here results from 100,000 reads from a sequel metatranscript home. And what happens when we're comparing against the trucker plan to database or the trucker plan three database from, from the human software pipeline. We find that our custom database of 500 genomes. So these are genomes, these are mags and genomes that have been assembled from previously generated sequel microbiomes from chickens. So to be small database just 1.1 gigabytes is in size. Compare that to 66 gigabytes is for the human three database. And then when we look at the quality of the alignments we get much higher quality than we do with these trucker plan database as well. So I think there's gathering appreciation that these custom databases are really important really driving, I think our ability to analyze these data sets, otherwise we just get lost within just the ever increasing number of genomes that have been generated. So how do we go about assembling these custom databases. So I was at the second center at the beginning of May and met with some of the magnify team. So, Morgan I think mentioned magnify yesterday so this is this seems to be a great resource what they're doing is they're compiling and they're generating mags for all the genomes that have been published, and they're starting to collect them into niche specific collections. So if you're interested, for example, in a acid mine drainage, metatranscript tone, you could go to the magnify database, identify all the mags and pull out a collection of mags associated with metagenome projects associated with acid mine drainage. Same for example for a chicken gut for the pig guts for deep sea. And so this seems to be a really useful approach for actually limiting the size of the databases that we need to be searching against and to speed up our actual annotation and make sure that we're not spending months processing these data sets but we can try and get this down to weeks or even days. So that's a really useful approach and I think that's a really good kind of really good innovation I think that that someone is actually doing that and putting those collections together. And then finally just to mention in terms of the actual processing of the reads. So if we are thinking of RNA-seq reads, if we are thinking of a typical RNA-seq experiment, we're converting our reads into expression values. So whereas in the old days you'd run a microwave experiment and you'd look at the brightness of the spot to quantify whether the gene was up or down regulated. So we're really looking at the counts of reads that are associated with each of these transcripts. But we have to account for the fact that these transcripts are different sizes. And so there's this concept of this RPKM reads per kilobase of transcript map. So here you're normalizing for the length of each of the transcripts. So in the part of our pipeline, we can incorporate tools such as bowtie and cufflinks to actually normalize for the length of the transcripts and come up with a more accurate kind of accounting of gene expression. Okay. So that's kind of functional annotation, that's kind of normalization of the reads. Just spend a few minutes on taxonomic annotation. So taxonomic annotation, we have alignment tools such as BWN diamonds. And I mentioned that because it's relatively short reads, it can be quite challenging just to use those kind of tools to identify taxa. And so there's a number of compositional methods that are available. And again Morgan referred to some of these yesterday morning. And crack and two, for example, is probably a good approach. We don't actually use crack and two at the moment, we will be putting that into our pipeline shortly. But at the moment, we're combining results of, I think, Kaiju centrifuge and actual diamond searches as well to give us a breakdown of which taxa are within our sample. So let's give you an idea of the performance in terms of taxonomic classification. So we have this majority voting rule, as I mentioned with diamonds diamond Kaiju and centrifuge. So this is from a data set of mice. So these are mouse cut taxa. And on the left hand side is the gold standard. This is where we know, within these mice these were infected with these ultra shedla flora. Under germ free conditions and we know that there's I think it's nine different taxa associated with these altered shedla flora. So that's all we should find in this mice and so we can use BWA to specifically come up with a gold standard to identify. Yes. Got five minutes left perfect I've only got half an hour left in this talk so that's great. So you can see none of the pipelines really perform as well as the gold standard but, and again I think this is something we need to be a little bit aware of when we're running these pipelines is they're not giving us necessarily the truth. And we do have to be a little bit careful about some of these areas that can, that can creep in and miss annotations and particularly in terms of tax on that we shouldn't take for granted that what the tax on that may not actually be the tax on that's actually present. This is just to show the difference between 16 s data and metatranscript tone data so the top is 16 s data the bottom is metatranscript tone data. And so we can identify groups of bacteria that are present, but apparently not active. We can also identify other tax so that may not be very abundant that can be very active so again just emphasizing that metatranscriptomics can give us information that metagenomics 16 s sequence income. They get roughly the same tax but this really tells you what's important within your data set. This was an interesting study that was published. This was 20. Yeah, just last year. So this was looking at using metagenomics metatranscriptomics look at the ecology of the vaginal microbiome. And looking at the evolution of the vaginal microbiome over time and what they found was that the expression from the metatranscriptomics was actually the better predictor of what the colony dynamics in terms of the abundance. That the next time step was going to be then the metagenomics. So again this just emphasizes that it's the activity, it's the actual expression of the genes and pathways, which is important for helping us understand how these kind of niches how these microphones are actually evolving. And part of the reason that they gave is that DNA can be slow to degrade and so it's hanging around and it's not really giving you an accurate reflection of what is happening at that point in time. So the last few slides I just want to mention, I am going to go five to 10 minutes over sorry Sydney. Just in terms of puncture annotation once reason being assigned to transcripts, these transcripts have hopefully been annotated with functions. Again Morgan went through some of this yesterday but I think it's worth just going over again. This concept of puncture annotation and what we do with this function annotation where we get that data. In addition to the Uniprop database as the eggnog database, and this again gives you mappings of things like gene ontology terms keg enzymes keg modules case items as well. We find that gene ontology terms can be challenging to summarize so there is a side escape tool and a plugin called bingo which enables you to identify over enriched gene ontology terms. As you see these papers and they just give these bar charts of these gene ontology terms can I, well, I have no idea really what that means. And so I think this gets at the crux of how do we interpret these really complex data sets. In some intuitive fashion that is not just staring at a list of bar graph showing us that one function is at some level or some other function is at another level. So this is where I think we're now starting to mature more and more as a field in terms of meta genomics meta transcriptomics so every time we do a new meta transcriptomics run in our lab. We try to identify a new functional category that we can add into our pipeline. So things that we've done recently are looking at things like iron capture so Siderophores iron storage kind of pathways. There's a nice tool called FE genie. So there's this concept of these other databases out there which are really good at focusing on subsystems that are associated with different functions of bacteria. And so can we start taking these kind of more specialized databases, putting them into our, into our pipelines that we can get a read out of, for example, not just metabolism but also I am capture iron transport. And Morgan mentioned the car database. So all biogenesis genes that I showed right at the beginning this is from protein protein interaction data sets have been generated for bacteria. So again there's a number of these kinds of databases and data sets out there that we can start placing these sets into kind of these broader functional categories to get more of an intuitive understanding of what are the functions that are up down regulated and what are the actual genes involved in these and isn't just reliant on these very broad gene ontology terms. We've done a lot of metabolic reconstructions in our lab it's one of the things we've been really doing for I think the past 15 years or so coming up with different tools one of our in house tools is detect. This is part of our pipeline we think in metapro we do really good job of annotating enzymes detecting enzymes with high confidence high quality. And this enables us to come up with, for example, sets of enzymes for each taxon. And I'm not going to mention too much of this day but again just thinking about future where we're going with these kind of ideas is we can start thinking from these data sets and again you could be applying this to metagenomic data sets and these mags is coming up with these metabolic reconstructions. So for each taxon we meet within your data set you can generate a metabolic reconstruction that metabolic reconstruction tells you what metabolic capabilities, each of those types that actually has. And this tool gaps he came out last year. It's, it's, it does a really good job of saying these are the enzymes that have been annotated. I'm going to fill in the gaps in these pathways because these pathways would not be functional without these enzymes there. And so it uses your initial set of gene predictions to fill in those gaps and come up with a model, a metabolic model of your of your particular tax on. And this is where I think the next generation is coming from for starting to interrogate these datasets is metabolic modeling so for example from a chicken data set. We can identify the tax or we can build metabolic models for these individual taxa. And then there are these metabolic modeling tools here this is one called back arena each these dots represents an individual bacteria. So we use these models to predict the evolution of growth, the evolution of the structure of this community. And these models enable you to do things like, what happens if I change the feed once if I add in this organism, this probiotic how's that going to change the community you can make predictions as to what kind of a tab lights are going to be produced and so forth. So we think that tying up these kind of metabolic predictions from your metatranscript own from your mags. So it allows you to start moving into these more sophisticated analysis where you can actually model what your data is actually telling you. And then finally I just want to mention a couple of slides and visualization of results. So here we use side escape. Again, we find this a generally useful network visualization tool, we can be using it for all sorts of our analysis it has figures for papers and so forth. And we think it's quite intuitive. So here, for example, this is from a mouse metatranscript term and here we've just mapped on the gene expression data from different taxa on each of these enzymes and then I'm just highlighting at the bottom there, the conversion this enzyme here, which is involved in the interconversion of fructose one six phase phosphate glyceride I had three phosphate seems to be largely mediated by bacteroids so that seems to be contributing to the function function and using these kind of visualizations you can see which taxa we're really doing which parts of the main function the pathway. One thing we, again with RNA seekers we're interested in differentially expressed genes. This is challenging for these metatranscript term datasets is very complex data sets. The statistics don't quite aren't quite appropriate for applying tools like DC to edge on so we really rely on some new methods being developed for doing differential expression. At the moment where we kind of rely on DC to we know that it's that it's prone to false positives. But when we've a tried alternative methods like Aldax tool and come we find that effectively don't get anything out. And so it's kind of like well that was a whole waste of time if we're just going to play out extra or ankle it's telling us that nothing is significant. So we go back to DC to and that's where the gene cell enrichment analysis then gives us more confidence. So one thing that these methods need to start taking into account is the need to normalize for tax on or gene abundance. So for example within your data set, if one of your data set doubles a tax on, you might expect the number of transcripts are also going to double but that's purely related to the fact not because that tax on increasing its gene expression. It's just that that tax on is increasing its abundance. So there's this need to kind of normalize for this tax on and so there's this concept of tax on specific scaling. This is very much a work in progress. I think people are starting to work on one of the best approaches to making sure that we can detect really significant statistically supported genes and differentially expressed. So finally, as I mentioned it's a little bit depressing in terms of the statistical measures out there what can we do about that so I think genes and enrichment analysis is really the way forward. So an additional layer of statistical support in terms of which functions are being up or down regulated. So the idea is you're collapsing rather than relying on an individual gene being up or down regulated. You're collapsing genes that are associated with one specific function into a group and then you're seeing if that group is enriched in that type of genes. So you've got a lot more confidence that a particular function being up or down regulated.