 Hello, and welcome to the Galaxy Community Conference 2021. This is a workshop on Metatranscriptomics Analysis, which uses microbiome RNA-seq data and analyzes it within the Galaxy framework. This workshop is conducted by researchers from the University of Minnesota, from the Erasmus University in Netherlands and from Norwegian University of Life Sciences in Norway. Microbiomes play an important role in the ecological balance, whether it is in environmental research or whether in health and disease conditions when studying clinical research. Multiple studies have shown the correlation of microbial composition with the physiological conditions. Gut microbiome research, for example, has been an important focus of research, especially since the gut microbiome has an effect on the health and the disease condition of the individual. In fact, the gut microbiome outnumbers in terms of genomic content or number of cells when compared to the human host cells. In fact, the gut microbiome can constitute as much as 5 pounds of your body weight. Apart from its effect on disease, it has been shown that a gut microbiome is unique to the individual and hence personalized medicine care would be required in order to treat conditions that have been implicated due to gut microbiome dysbiosis. In order to study gut microbiome, researchers have used methods such as metagenomics, wherein DNA from either the gut microbiome or from environmental samples is extracted and is subjected to either 16S-RRNA studies or whole genome sequencing studies. The 16S-RRNA studies also called as Amplicon Sequencing helps one to identify the taxonomic composition of the microbiome, while the whole genome sequencing might help you to identify the genes that are present or based on the DNA content of the genome sequenced. This helps to predict some of the functions that can be expressed by this microbiome. All studies using metagenomics have helped to correlate the taxonomy with the observed phenotype. However, over the years, it has been shown that apart from studying the taxonomy of the microbiome, it is important to understand the functional expression of the microbiome with respect to the condition that it has been exposed to. For example, in 2012, a study that involved collecting samples from multiple body parts of multiple individuals showed that if one were to study the different genera that are present in these samples, so if you were to look at, for example, the buccal mucosa, you can see that the taxonomic composition is variable if you were to look at samples from individuals across the world. Similarly, if one were to use the metagenomic data that came out of this and were to predict the metabolic pathways that are associated with these microorganisms, one would find that the metabolic pathways are relatively consistent across various individuals. This indicated that using functional pathways or functional expression could be a better way of understanding the effect of perturbations on microbiome since you already have a basal level setup using functional studies. So in order to study the microbiome, as I mentioned earlier, metagenomics, which involves studying the DNA content and the taxonomy of the microbiome, is fairly popular. In fact, there are quite a few software tools that help you to correlate the taxonomy with the physiological condition. One can also predict function based on some of the bold genome sequencing methods that are used. However, to get a better understanding of function, researchers have started using metatranscriptomics, which involves looking at the RNA expression of the microbiome. This has an advantage of not only deciphering the taxonomy of the microbiome, but also understanding the various functions that are expressed in terms of RNA expression. In the past few years, researchers have also started studying metaproteomics, which involves identifying proteins or peptides that are expressed by these organisms. And here, one has a slightly better understanding of function just because of the fact that proteins represent enzymes and various effector functions that a microbiome expresses. We believe that both metatranscriptomics and metaproteomics have the potential to unravel the mechanistic details of microbial interactions with the host or the environment. With this, we'll move on to the hands-on section of metatranscriptomics analysis. As you can see here, metatranscriptomics is a multi-step workflow, which involves multiple software tools and convergent steps, starting with pre-processing, which will be covered by Sasuke Hultemann. And then taxonomic composition, which will also be taken care by Sasuke Hultemann. And then Subhina Mehta will talk about functional analysis of the microbiome. Now that you have some background on metatranscriptomics, let's talk a little bit about the bioinformatics that makes metatranscriptomics possible. So like other large data omic technologies that generate large volumes of data in complex data, they require a number of tools to analyze the data and provide results that have biological importance to researchers using the technology. So metatranscriptomics is no different in that way in that, as you can see here, something you'll hear more about as this workshop goes on, there's a number of different software tools to take in data, process it, and then annotate and analyze that data in order to give biological results that a researcher can use to generate new hypotheses and take away biological knowledge from their experiment. So how do you approach this general challenge of needing to do complex bioinformatic data analysis but do it in a way that is accessible to, say, non-expert bench scientists and researchers who want to deploy metatranscriptomics in their work? So one solution and one that our group has worked in is Galaxy. Without going into a lot of the basic details of Galaxy, I'm going to just give a little bit of the history of why we turn to Galaxy for the work that we do, including metatranscriptomics and why it's a very useful platform for this type of bioinformatic application. So Galaxy, which has now been around for over 15 years, has a number of features that are really advantageous. One of those being geared towards bench scientists, so that you do not have to have advanced training in terms of programming skills to use it, and also the many training resources that are available. Very much from the start has always been the home for genomic and transcriptomic data analysis tools. That was really sort of the genesis of Galaxy as a platform, a workbench to bring together lots of tools in these areas that could be integrated together and made more usable by the community. And also, as we discovered, one of the main focus points of our work is actually in proteomics, mass spectrometry-based proteomics. So our group, in addition to being interested in things like metatranscriptomics, has been developing Galaxy extensions to bring in software for proteomics through the Galaxy for Proteomics project, this Galaxy P project, which is the logo you're going to see on a number of our slides. And as you're going to see, what we have done in terms of approaching Galaxy development is this idea that we do not want to reinvent the wheel. We take many high-value tools that have been generated as sort of high-value standalone software and use the features of Galaxy to implement those in Galaxy so that they will run through the Galaxy platform and then integrate those with other tools, including customized tools and things that make these workflows and things much more available and usable by the community. So another look at Galaxy is this. It takes in data sets. This can be metatranscriptomic data or other types of omics data. It's a home for then software tool. So it could be diverse different types of software tools that can be integrated and made interoperable together as well as visualization tools and some of the things that you will see here in this workshop. These can be implemented. The Galaxy platform itself will run on diverse different types of scalable computing resources such as the cloud or local high performance computing infrastructure. All of that is encapsulated within the platform, along with a user interface that's meant to be available to, again, non-programmer, non-experts who can learn how to use this interface and run the software and run analyses. As well a bit more advanced programmatic API where you can actually run this through the command line if that's something that you are familiar with and would like to do. And generally, it is then this environment that brings together all of these features to integrate data and tools and visualizations on these scalable computing resources. So that's a little bit about Galaxy. And as you're going to see today in this tutorial and you may already be familiar with, the interface has a few different features to it. It has a column here that is the tools feature where you can now go and find the software that has been implemented within the Galaxy platform that you may be using. This sort of center pane, which is really kind of the working window that you will see as you start your analyses, as you're putting settings for software, as you view results, as well as then this idea of the history. And this is something that's really important because when you run an analysis, it records and keeps track of all of the steps in that analysis, as well as recording and archiving all of the inputs and the results, intermediate or end results that come out of the analysis. Just another note about this is that what one feature that we always were very much attracted to about Galaxy and is a big part of the work we do is the community-based focus of Galaxy. And in that, there's a Galaxy tool shed. So as tools become available in the Galaxy environment, they can be published to this public tool shed. And then others who use Galaxy and maybe their local installs of Galaxy can find those tools and install them in their local instances. Another key that you will see here today is this idea of workflows and histories. So as I've already alluded to, when you get a bit more complex bioinformatic analysis to do, many times it goes from a single software tool where you have a single input that runs through a software tool and it gives you an output to more of a workflow where you now have several software tools. The input goes into one software tool and the output that comes from that software tool may be used as the input for a next downstream software tool. So sort of this pathway, if you will, of software that all work together. And that's the concept of a workflow. So you can build in Galaxy once these tools are integrated that are interoperable and can work with each other. You can build these integrated workflows which link together these different tools. And a nice feature is that these can be saved with all of the various optimized settings for each tool. So now it becomes a much more automated process wherein data can be taken in and the workflow started. Each tool in succession runs and gives you a final output. And the other piece here is a history, which is a bit beyond a workflow. It's using the tools that might be a part of a workflow but then it's also saving all of the input and output data that was utilized with that workflow. So it's really a recorded archive of an analysis that you have done and Galaxy records all of these analyses and gives you a chance then to explore intermediate results as well as the end results that may come out of an analysis or an analysis workflow. Moving along with this idea of workflows and histories, what is also a nice feature is that one can share, whether it's a history or a workflow, can share these saved analyses or these saved workflows that have all of the different tools with all of the settings that are already incorporated in. So it makes it very much usable and reproducible by others that you want to share these analyses with. And this can be done through sharing of a URL or there is a function where you can actually save these as a file type that can then be uploaded in other Galaxy instances to be used. Going along with this idea of this community focus of Galaxy, another real nice feature is that access to Galaxy has also been a big focus of the community. So there are some publicly available resources like this usegalaxy.star network. So there is a Galaxy instance publicly available that you can register and use at Galaxy in Europe, Galaxy EU. There's usegalaxy.org, which is a United States based Galaxy instance. There's an Australian one as well as some other public instances that are available around the world. This is just a snapshot of the proteomics portal that is part of the European Galaxy public instance. So another aspect here is trying to not only train people how to use these tools for their own analyses, but give resources and access where these analyses can actually be run. Hi everyone, my name is Saskia Hilteman. I work at the Rustic Medical Center in Rotterdam in the Netherlands. And today I'll be walking you through the first half of the metatranscriptomics tutorial. And then I will hand over to Sabina Mehta for the second half. In this tutorial, we will go over how you can analyze metatranscriptomic data and what kind of information you can extract from this. And we'll show you how you can assign a taxonomy and function to the identified sequences. So there are multiple reasons to study the microbiome. One big reason is healthcare research. So human bodies have a lot of microorganisms in them in various places and these can affect your health or how well drugs work for you. And there's even so much genetic data from microorganisms that it's sometimes referred to as your second genome. And another big area of study are environmental studies because microbes also are present in the soil and this can affect how well plants grow, for example. So it's studied in agriculture a lot. And I think you've probably seen this slide before, but in metatomics there are several different levels at which we can look at the microbiome. So we can do metagenomics. This is often used for taxonomy and also a little bit for function if you do a whole genome shotgun sequencing. And of course this only shows you the potential that the microbiome has for these functions. If you also want to look at what is actually being expressed, you look at metatranscriptomics or metaproteomics or a combination of these. So today we'll focus on this level, the metatranscriptomics, and look at both taxonomy and function. This tutorial uses the assigned pipeline. So here you see a schematical overview of this. It was published in 2018 by Bernice Patu from Freiburg and colleagues and more recently a specialized assigned MT, assigned metatranscriptomics was also published. It was specialized workflows for metatranscriptomics data, so that's what we will cover. Now this pipeline, the first part, like with any analysis, will cover quality control so it's always important to clean your data, to assess the quality before you begin. And then we have two downstream parts. One is to really look at the taxonomy, who is there, which micro-organisms do we have, and then secondly we're also going to look at the function, like what are these micro-organisms doing. And at the end of this we will do some visualization as well. And then we will use data from a study, from a data set, studying cellulose degradation in a biogas reactor, so they took samples of this community in this biogas reactor over different time points to see how it changes over time. So it's a lot of data in this tutorial. We will only look at one time point, but we'll also show you some of their results looking across multiple time zones. Now before we go any further, I am going to start with the hands-on part, and then we'll come back to explain some of this. So let me just go, let's start by accessing Galaxy, so make sure you have your Galaxy instance open. I'm using Galaxy Europe, but this works on multiple different Galaxy servers, so make sure you are logged in and start a new history. So if your history is not empty, just click the plus icon here to start a new history, and then just name it something that you can remember Metatron. Okay, so all the steps we will do today are also in a GTN tutorial, so to access that you can click this little hat icon to overlay the training materials over your Galaxy. So for you it will probably show you the homepage, so I'll just show you how to get to the tutorial from there. So this is the GTN homepage, we are going to go to Metagenomics topic, and within that topic we're going to scroll down, and at the bottom you see here Metatranscriptomic analysis. You'll notice that there are two versions of this tutorial, so we have here the full tutorial and a shorter tutorial. For the sake of time today we'll do the shorter tutorial, so if you click on the computer icon next to that you will open the tutorial itself. So here you can find all the information and more background information than maybe I covered today in this video, so if you are curious and want to read more about it I would recommend going through this tutorial, and if you want to really go through it step by step tool by tool I would do the long version of this tutorial because the two tutorials are exactly the same, except the short tutorial we will use a set of workflows for each section just so you have to do a little bit less clicking and we can focus more on the outputs and what is happening, but otherwise the two are the same, and you can also switch between the long and the short version at the start of each section, in case you are more interested in certain sections and want to get a little bit more detail you can switch to the full one there and if there are other parts that you maybe are a little bit less interested in right now you can do them in the fast tutorial, but let's start with getting our data into into our history so there are two ways you can do that if you scroll down here to the data upload part it'll instruct you how to do this we need these two files so we can copy these URLs that we see in the gray box I can use this copy button and then if we click outside this tutorial again we'll go back to our galaxy server and then we're going to go to the on the left side to upload data paste fetch data and then we just paste our URLs in and hit start okay so you see that my upload is finished so I have the data in now all I need is the the workflow we're going to run so we're going to go back to the training materials so the first section is all about quality control we do several steps and I will explain what each of these steps does after we start the workflow but we're going to assess quality using fastqc and multi-qc then we're going to filter these reads using cut adapt and we're going to remove ribosomal RNA using sort me RNA and then we're going to interlace these fastq files into a single file because one of the next tools needs that so just save you a little bit of clicking we have made this part of the of the assigned workflow also available as a smaller workflow which we can import so how to do that is described in this next hands-on box and so let's start by copying this URL of this workflow you can right click and copy the link and then we will go back to our galaxy at the top here you see this menu workflows if you click on that you will see the list of all your workflows and at the top there is a create button and an import button so we are going to import an existing workflow from URL so click on the import button and here it asks for URL so we're just going to paste that in and say import okay and I see here at the top of my list I see workflow one preprocessing now if you want to see what that workflow looks like you can always click on the title here and click edit and then you will go to the workflow editor and you can see a little bit what goes on in that workflow but I will explain it in more detail but here you see every one of these boxes it's a tool that'll be run and these two are the the input data sets so fast you see on both the input data sets that goes into multic you see and then based on what we see here we will run cut adaptor to do some filtering and trimming on the data then take out the revisable RNA and then interlace but we don't want to edit it we really just want to run it so we are going to go back to our workflow menu and this time hit on this play button to the right of it and then we can start our workflow so it needs two inputs so once a forward fast queue file on a reverse fast queue file so we only have two files in our history so make sure you select the file with forward in the name for the forward file and the reverse file for the second input and that's it then we just hit run workflow and now you can see here a little bit the progress so first it'll schedule your your jobs and then you will see progress bar as it starts kind of running and completing these jobs but while we wait for that I will go back to the slides and explain a little bit more about what happens in this part of the tutorial on this workflow so the slides that I use are in the same place next to the workflow sorry next to the tutorial so you can click on this this logo if you want to see them yourself so let's catch back up okay so the input format that we have our fast queue files probably you've seen fast queue files before if you have not I recommend you follow the quality control tutorial before jumping into this one but just as a refresher this is what a fast queue format looks like every read is described by four lines the first line is the name of the read the second line is the sequence then you have this plus or sometimes it's it's the name again and then you have the quality score so each of these symbols and letters encodes a quality score so this means the c has a quality score of a this t down here has a quality score of colon okay it all sounds a little bit cryptic so what does this mean depends a little bit on the sequencer exactly which of these characters maps to which quality score what you see here are nowadays almost all of them will be a fast queue Sanger and so Lumina also uses this nowadays which means that for example the a that we saw before means a quality score of 32 and the colon is here so that means a quality score of 25 thoughts a little bit more intuitive to grasp four days better than 25 for example but still what does this mean so these are Fred scores and Fred quality score of 10 which would be this case let's say 1 2 3 4 5 asterisk that would mean that the sequencer has confidence of about 90 percent that this is that it made the correct call so there's a chance of one in 10 that this basically is incorrect and this logarithmic scale so if you go up to quality score of 20 that means okay and now I'm 99 percent sure about this call and if you go up to 40 and then the sequencer says okay I'm 99.99 percent sure that the space is really what I why did I said it was but of course in your fast queue file you have thousands hundred thousands millions of reads so you don't really want to look at this in this way so that's why we use tools like fast QC to sort of make this to a little bit more user friendly so we can assess the quality so the forward reads are in one file the reverse reads are in another file both of those we will send to fast QC for a quality report then we will merge those together into a multi sample quality report and then based on that we will do some filtering and trimming so you see here I made a dotted arrow which means usually you in between here you have a look at the quality report before you decide on the settings to use for filtering and trimming and often it's after that you look at the results after filtering trimming you do another quality assessment see if it has improved enough and maybe you do this multiple times until you end up with a clean data set of high quality that you're happy to continue with so once we are happy we will go on and filter the ribosomal RNAs from this data set so ribosomal RNA is very handy for taxonomy for identifying which species we have but not so much for functional annotation of the sequences so before we do the functional part of the analysis we want to remove these ribosomal RNAs and then the last step we do in this workflow is we take these forward and reverse reads the clean versions of them and we turn them into a single interlaced file that contains both forward and reverse reads so this is the same again with the tools that we use in this case of FASQC for the quality reports multi-QC for the multi-sample quality reports cut adapt for filtering and trimming sort me RNA to remove the ribosomal RNA FASQ interlaced to combine forward and reverse FASQ files into a single FASQ file okay so once our workflow is finished we will see a live example of this but FASQC again you probably saw this in the quality control tutorial so I won't go into too much detail but that tutorial really has a lot of information so if you are curious about these reports please follow that tutorial there's also more information in the metatranscriptomic tutorial about these specific reports but this is basically what it looks like so you get a little web page that has some summary metrics at the top and then a bunch of plots showing you something about the the quality so this is much nicer to look at than the raw FASQ files so for example this this plot here shows you the quality per base on the left here you see in the first read in every base on the average has a quality of 33 and all the way over to the end of the read and you see that this is quite typical usually you see this the beginning of the reads are more accurate and then towards the end it gets harder and harder for the sequencer to make a correct call so the quality drops but you see here it still remains over 30 throughout and so that is still considered quite good that quality score so we would be happy with this but if you see this drop down into the red zone you would probably decide to do some some trimming of the ends here and say take all my reads remove the base or remove bases from the end if they are under a certain threshold if they are under 20 for example and so then you will have cleaner data but you also have shorter reads so you want to find this balance between cleaning up your data but also not throwing away too much of your reads because you're also throwing away information and especially if you're going to do something like mapping yet this might lead to less accurate mapping downstream so just try and find that balance okay so we discussed this plot but it's not the only one so there are many more here there's a link here to the dedicated QC tutorial and I would also just take your time look through all of these and if you have any questions look in the the tutorial or let us know and I will show you some of these after our workflow is done as well well now these are really nice outputs as fast as you see it's really nice but it only works for one sample at a time which is great if you only have one or two samples but once you start doing analyses on hundreds of samples at once you don't want to really look at 100 individual reports like this so there's a really cool tool it's called multi QC so this can combine multi multiple outputs from other tools so it can take the fast QC output from from different samples and combine them into a single report so here's an example from these reports it'll give you some summary information about different samples there's a GC content length duplication level etc and also show you like these these plots that you against recognized from fast QC but now it shows you the plot but with several samples shown in the same plot so you can hover over these to see which which line belongs to which sample so it's really a nice interactive tool and the great thing is it doesn't work only for summarizing fast QC output if you have outputs from other popular tools multi QC can also combine those so it's very versatile tool very useful okay so after we assess our quality we maybe decide okay we need to clean this up a little bit maybe the ends aren't looking so good so then we use a trimming and filtering tool in our case we use cut adapt but there are a lot of tools out here that do basically the same thing so you can try several of these to find which ones work best for you so one thing cut adapt will do is it'll trim low quality basis from the reads so for example at the end if the quality drops too low it'll just cut them off and make the reads shorter but if because of this the read becomes too short we also probably want to throw it away because it won't be useful for us for downstream analysis so we'll also set some thresholds on on the length of these things and it also looks at for example mean quality score so if the average over this entire read is too low we'll just throw the whole read out and something else it can do which is very useful this can remove any adapters or primers that may be left in your data so often when you get your data back these will already have been removed and you don't have to worry about it but tools like cut adapt can also detect these especially if they're from popular platforms and just remove them for you so like I said cut adapt is not the only tool that does this you also have trim galore trim ematic many more and lots of these are in galaxy as well so please go and explore and then like I said after that so we've done our assessor quality we've cleaned up the data based on those quality reports and then we are ready to do assign a taxonomy but the other thing we want to do is also assign function and for this we want to remove the ribosomal RNA sequences first because they're not very informative so there is this tool called sort me RNA that will take all our reads it'll align them against a database of ribosomal RNA sequences and then everything that is is a good match with known ribosomal RNA sequence will be taken out and then for the functional part of this analysis we will only look at the data set without the RNA sequences and then the final step of this workflow is to ask you into there so this is a little bit more of a technical thing that we need to do but most of the time if you have parent and data you will have two files one file containing the forward reads and one the reverse reads so when we do parent and sequencing typically the DNA is cut into or the sequence is cut into longer fragments like this and then part of the one side of the fragment is a sequence say 250 base pairs and also part of the the end and these belong together so we have not only the sequence for these smaller parts but we also know something about how far apart these two should be so this gives us extra information for the mapping steps so for example if this forward read could possibly map to multiple places in the genome then we can use the information about the reverse read to narrow down which is the real real place it came from so we can say okay not only does this have to match but we know that say 500 bases downstream this reverse read also has to map and then you can do a little bit more accurate mapping and so like I said most of the time you will receive this in two separate files but sometimes you get this in one interlaced file and some tools also expect two separate files and other tools expect one interlaced file and one of the ones that we are going to use today really wants interlaced format so we are going to convert it using fast queue interlacer so how that would look as it would take these forward and reverse reads basically just alternate them and put them in one big file so the first read in this file will be forward read of the first pair then comes the reverse read of the first pair then the forward read of the second pair and the reverse feed read of the second pair and so on and there's also tools to do this the other way around by the way so if you have an interlaced file like this but you have a tool that needs to separate forward and reverse reads files there's also deinterlacing tools to go back here so this is really just a format conversion that we need to do for some of the downstream tools okay before we move on to this next part I'm just going to check on my galaxy analysis so let me see how far my galaxy is so we see it's still running but it should be done soon so just wait until yours is done before continuing so but we see that some steps are already done so we can already start by looking at for example the fastqc output so we see fastqc we have four items here two that run on the forward reads and two on the reverse and the interesting one here for us is the web page output so if we click here on the icon for one of the web pages so this is the the forward read then we can see this this fastqc report let me just drag these ends over to get a bit more room so here you can see at the top yeah some more information so we see we have a 260 000 total sequences each sequence is about 150 base pairs long and you see the quality is already quite good at least for this this data set the forward reads and also this the per sequence quality scores are also very good very high so anything between 30 and 40 is very good and then there are a couple more plots here and I would take your time and look at these and also look at the the qc tutorial about what all these plots mean for example this gives you the appropriate sequence content and so this gives shows you how many what the percentage of A, C, T's and G's are and this pattern for RNA-seq is pretty typical so you have here sort of these these spikes at the beginning and that is due to random hexamers that are used in that protocol yeah so look at this a bit and feel free to ask any questions if you have them so the nice thing we can now show also multi qc so let's go to multi qc in the web page output again and now you can see not only the one file but you can now also see both forward and reverse in the same in the same plot and you see it's all the same plots as we saw just now and if you click on here you see this plot again it has some extra features so now you can for example see a map here and some other things so this is really nice especially when you get to have more more samples in this okay so i'm just gonna wait really quickly until this is done and then we can continue okay so now my workflow is finished so we can check out the other outputs so we already looked at multi qc and fast qc the other thing we did was cutadapt so cutadapt gives you fastq file with the cleaned reads but it also gives you a report that you can look at so if you click on the icon for the report output it'll give you a little bit more information so in the summary it'll tell you how many pairs were progress process how many adapters were still present so we see that in our data set the adapters were already trimmed off and i'll show you how many reads were filtered out and for what reason so you can see here for example how many base pairs were were processed and how many were removed because the quality was too low and the top here you can see after this removal of base pairs if any of the reads were too short they were removed and we see that for this reason 10 percent of our reads were removed and now we were left with we started with 260,000 sequences and after this quality control step we are left with 232,000 reads about 90 percent so that's all it looks good but it's always a good idea to check this to make sure that this step wasn't just strict and you didn't for example throw away 90 percent of your data and are only left with with a little bit that might not be enough information for your downstream reports so always make sure to check these reports just make sure that they make sense for your data okay now so we did quality control quality assessment we filtered and trimmed and another thing we mentioned already was sort me RNA for the functional assignment we want to get rid of ribosomal RNA sequences so sort me RNA so you see that it gives two outputs which are FASQ files so these are the FASQ files that are not ribosomal or sorry that are ribosomal the aligned ones and then the unaligned ones are the ones that are not ribosomal so they didn't align to this database of ribosomal RNA and it also gives this log file output and if you look at that one a little bit the first bit is not so interesting a little bit cryptic but if you scroll down to the bottom you see here under this results header some more information so we can see here it processed 460 000 reads in total so that is 232 000 pairs and then it also shows you how many were past the E value threshold so that means how many were deemed to be ribosomal RNA so we see that 25 percent of the reads approximately were thought to be ribosomal RNA and the rest are not so this 75 percent of non-ribosomal RNA sequences are what we are going to use for the functional part of this this workflow okay and then the very last file tool we ran was the interlacer and you see here that we yeah we end up with a single FASQ file now that contains both forward reverse reads and you can always recognize that if you look inside the file you have here this read name and then it has slash one here meaning forward half of the pair and then you see the next so these four lines are the first forward part of the first pair and then the next four reads are the second read which is has the same identifier there and then this slash two behind it to indicate that it's the reverse file so it's the exact same information as the two separate files but just put into one file because some tools want that okay so now we have a nice clean data now we can get on with the actual exciting part of yeah finding the taxonomy of our sample and the function so I'm going to go back to my tutorial and I'm going to do the same thing again I'm going to start the workflow and I'm going to explain what the workflow does and like I said there is a more information in the tutorial itself especially in the extended tutorial if you want more more background information about this and about each of the individual tools so in the next part we are going to do the taxonomic profiling so we're going to ask ourselves which organisms are there what do we have what is the community structure and then after that Sabina will take you through all the functional information analysis so we will use for this taxonomic assignment we will use a metaflan so this will take all our data set and compare it to a database of marker genes and from over 17,000 reference sequences bacterial or k viral eukaryotic and then after that we will we'll visualize this using krona and graph land tools okay but again to sort of save you a little bit of clicking we combine these steps into a single single workflow so let's do the same thing again here we copy this link to this workflow and go back to our galaxy and go to workflow menu at the top we click import and we paste in the url which is copied from the training manual and hit import workflow button and again we want to run it so we're going to scroll to the right and click our own workflow on this workflow to community profile it should be pretty clear about what it wants here so it has again two inputs so it just wants the clean data set so the qc controlled forward reads so those would be okay they're named already properly in in your history so we take care qc controlled forward reads and qc controlled reverse reads and then we hit run workflow okay so while this is running i will go back again to the slides and explain to you what each step does um okay let's go so what we our objective here is to find the community profile so we want to identify which organisms are present in our sample and what the relative abundances are so here is a a little depiction so we want to know like okay how many different species are there and also do we have how many of these red guys do we have for example compared to these these green guys so for this we use a metaflan 2 for identification and then to make this sort of a little bit nicer to explore and to evaluate we'll use a visualization again so metaflan 2 what this does it estimates the presence and relative abundance of microbial cells by mapping against a set of marker sequences in a database and i'll keep in mind that this tool was originally designed for dna seek data so we can still use it in metatranscriptomics we just have to be a little bit more careful about the interpretation so because for dna seek we can with these relative abundances we can say something about okay this organism is more present in the sample than than this other organism for example but with transcriptomics data we have to be careful of course because this can be skewed by expression of these different organisms but it still gives us valuable information then after this so the outputs from metaflan again aren't very nice to look at and not very friendly to human eyes so we use visualization tools to to make this a little bit nicer to explore for us so one very nice tool is called krona so this visualizes a community composition in an interactive plot so you can see here already if i hover over changes and i can double click somewhere to say i show only the bacteria and you see that in this case we have pretty sort of simple sample we have two species here and you see that that that this one is present in a 95 percent of the sample or accounts for 95 percent of the sequences in our sample and the other five percent from this other organism and a second nice visualization tool we can use is graph plan so this shows us also the composition but a little bit more about the the relationship between the different species we found cladogram view and all this can lead us to to sort of making or seeing how the composition of our sample changes over time so we in this tutorial only do it for one time point from this data set so that would be one of these bar graphs and say that okay most of what we find are these two two species clostridium and coprothermobacter but you see that if you look over multiple time points that sort of the composition changes a little bit so in this case we're talking metatranscriptomics so this is also the expression changes a little bit so yeah and then you can sort of follow this community profile over time that's exactly what they did in this study as well so i'm going to go now back to my galaxy and show you these outputs live and then i'm going to hand over to the next part for subina for the functional analysis okay so my jobs are still running so again i'm going to wait till those are done and then discuss with you okay so my workflow finished now so we can have a quick look at these outputs so you can see here that metaflan made quite a few outputs so you can look through these a little bit but for example look at the predicted tax on relative abundances if you look at the file you see here it's again a little bit cryptic file this shows you everything that it has detected in terms of taxonomy so it gives you the the full taxonomy it says the kingdom so you see here that mostly we find bacteria and archaea and then it just goes down down the list all the way to to gen around species so here you can see but again this is not very nice thing to look at but you can see here at the bottom the the main species it also gives you the the band file output so how these aligned and the biome file so biome is is a standard file that you can also use for for other input to other tools visualization for example but the visualization that we are going to focus on today is krona so one of the metaflan outputs is already specifically formatted for krona and then if you scroll up a little bit more you see that we ran krona and the output there's an html output so if you look at that it's it's the same as what was in the slides so again we have a fairly simple community in this case we see we have a small number of archaea here and the rest are bacteria and if you're focusing on those bacteria we see that it's 95 percent coming from from this organism and five percent from this one and then you can always go back up to the root by clicking there and I'll show you again everything that it found and let's just look at the second visualization so graph land also makes a couple different outputs so it makes a makes a tree and tree in a different format again you can use this to connect to other tools but for now we're just going to look at the image output the png and you see here that it's not quite as exciting as the one I showed in the slides again because we have a fairly simple community here but you see here that it shows you this this one species from archaea and these two bacteriums and also the relationship between the two so you see that these are very unrelated and the size of these bubbles sort of also shows the relative abundance so you see here 95 versus five percent so again if you have a more complex community structure this becomes a little bit bigger these are very two very nice visualization tools for showing the community composition of your of your sample okay so with that I think we're now ready to focus on the other really fun part functional analysis of these of this community so not only who is there but also what are they doing and for that I will hand over to Subina who will will now continue with functional analysis part so thanks and have fun hello all this is Subina Mehta I will be covering the last part of the assignment workflow which is extracting the functional information from the metaphrascriptomics data now that we know who is present in your sample using the previous workflow we would now like to understand the question regarding what are these microorganisms doing or what are the functions performed by the microorganisms in the environment the three main outputs from the workflow are in obtaining information regarding gene pathways gene ontology which is further classified into biological process molecular function and cellular component lastly without the least the gene family before getting into details of the workflow let's do a hands-on training on how to run the workflow for that in the use galaxy instance go to the share data on the top click on workflows search for galaxy p and or any of these tags metatranscriptomics or metagenomics then click on the metatranscriptomics wf3 functional information click on that workflow and then import it using the plus sign I will be showing two methods to import the workflow next so after running your second workflow your galaxy interface should look something like this with the second workflow completed here the last output should be the graph land png output right now I will show you two methods for importing the workflow to run your functional information workflow so for that we use two methods getting it from the share data or getting it from the galaxy train network the first method I will show you is from the share data so for that go ahead click on share data select workflows and we will look for metatranscriptomics workflow three functional information workflow the owner is galaxy p currently this is on top of my list but it doesn't have to be so so if you want to look for this data please search here you can search for metatranscriptomics galaxy p or metagenomic tag now when you search for that there will be two workflows that will come up on your screen if you look carefully both the workflows are similar but there's a small difference the quick part of the workflow we generally use for our hands-on online training where we have time constraint in the short version of the workflow we do not have the human tool because it is very time-consuming but for this workshop we are using the longer version so please select the metatranscriptomics wf3 functional information please make sure you're not using the quick version please select that and you can see the tools that are present in the workflow by scrolling down to import this workflow please click on the plus sign own on the right side once you click on it it will import the workflow into your own account you can click on start using this workflow once i click on that you will see that is the workflow it comes into my account the other way to import this workflow is by clicking on the galaxy training materials go to metagenomics select the metatranscriptomics short version even though this says short you will still have the longer version of the tutorial present you can scroll down or you can directly click on extract functional information and you can see there is information regarding how to import the workflow so as it is written over here copy the url via right click or you can download it to your computer and upload it later but i will right click it copy link address and then click on outside of this screen select import and paste it on the archived workflow url information and then select import workflow now the workflow is already being imported now i can choose either of them i'll just choose the first one just to show you how to run the workflow to run the workflow you can either click on edit if you want to see the tools involved in the workflow which you can drag and then run the workflow by clicking on the play button on the right side of the screen or you can just go to workflow and then there is play button on the right here so if you click on that the tools present in the workflow will be queued up in your center pane now please make sure you're selecting the appropriate inputs you need for running the workflow it is asking for the interlease non-rRNA reads so we will go ahead click on it and select the interlease non-rRNA reads the next it is asking for the predicted taxon relative abundance of input so for that please click and scroll down here you'll see two predicted taxon and relative abundance outputs but please select the predicted taxon relative abundance only not the one for krona so once you select that go to the cut predicted taxon relative abundance table so we'll select that and select the appropriate input once you have made sure that you have selected the appropriate input you can scroll down to see all the tools once you're confident that your inputs that you have provided is correct please go ahead and run the workflow when you start running the workflow it could take a few minutes or a few seconds at least to start invoking the workflow once the invocation step is complete all the tools that you have just ran would show up on the right side which is in the history pane the tools will be gray at first as you can see when it's running it will be orange in color and once completed it will not that you have imported and successfully run the workflow let's move on understanding what you just ran in the metatranscriptomics data we have access to the genes that are expressed by the community we can use that to identify genes their functions and build pathways etc to investigate their contribution to the community using the human tool human stands for human microbiome project unified metabolic analysis network it is developed by the huttenhower lab it is a pipeline in itself that is wrapped as a single tool in the galaxy platform the two main inputs for this workflow are the interlaced non-rRNA reads and the predictic tasson or community profile from metaflan tool to identify the functions made by the community we do not need the rRNA sequence especially because the ad noise and will slow down the next few tools in the workflow are used for downstream processing of the human data I will be going into details regarding each and every tool later on human as I mentioned earlier it is a pipeline developed for efficiently and accurately profiling the presence or absence and abundance of microbial pathways in a community from metagenomic or metatranscriptomic sequencing data it efficiently characterizes microbial metabolic pathways this is the galaxy wrapper of the human tool the two main inputs are the interlaced non-RNA reads and the taxonomic profile from the metaflan tool the user has an option to add additional id mapping file for alignments but in this workflow we have not added that feature the tool itself consists of inbuilt databases such as chocoflan database uniref50 and 90 database and metasequan unipathway database for pathway analysis the three main outputs of the human tool are gene family and their abundance gene pathways and their coverage and the pathway and their abundance let's go into detail on how the human tool functions human performs a teot metaomic search the main input of the tool is a quality controlled metageno or in this case metatranscriptome the next step is the initial taxonomic screening by using the metaflan inputs that you have just provided it maps the reads to clay specific marker gene to rapidly identify community species then it maps the reads to pan genomes of identified species and performs a nucleotide level mapping against the chocoflan database the reads are now classified or unclassified reads the unclassified reads are aligned to a comprehensive and non-redundant database which is uniref90 or uniref50 through accelerated translated search the results are further searched with the unipathway or metasequan database to give you the pathway information the mapping results give you gene family and pathway abundance by looking at the gene length alignment quality and gene coverage here is an output of the gene family abundance and gene pathway abundance this file details the abundance of the gene family and pathway in the community gene families are a group of evolutionary related protein coding sequences that are often performed similar functions here we are using uniref90 gene family sequences in the gene families that have at least 90 percent sequence identity both the gene family abundance as well as path k pathway abundance is reported in rpk values which means reads per kilobase these are units to normalize for gene length it reflects the relative gene or transcript copy number in the community you can see there are things called unmapped or unintegrated in these two output these are really the reads which remain unmapped even after both the alignment steps which is the nucleotide as well as the translated search now that we have the gene family output we can see how complicated it is to understand thus we can use the regrouping tool which is called the human to regrouping tool to group the gene families or convert the uniref50 and 90 outputs to different categories by providing it with the id mapping file now it could be converted into a merasic reactions keg or the groups pfam domains enzyme commission categories gene ontology informative go and slim go in this workflow we use the gene ontology terms the gene ontology or go term analysis is widely used to reduce complexity and highlight biological process in a genome wide expression study this dedicated tool groups and converts the uniref50 or 90 gene family abundance generated by human into go terms now as a regrouping tool converts the uniref50 or 90 values into go terms we still think that the output is too precise thus we use the renamed tool to further classify the go terms into molecular function biological process and cellular component categories there is another tool which can split the human tool output into a stratified or un stratified table which basically means that the stratified table will provide information regarding the go term the function genus and species involved along with the abundance value whereas the un stratified table will give you information regarding only the go terms and their abundance in the sample here is an example of how the unit of value is converted to go terms using the rename tool as you can see here these are the go terms the genus and the species along with the abundance in rpk values in your gene family output which is the human output when you convert into go terms the outputs look something like this that means it will have the go id the function which is molecular function in this case and then its information genus and the species identified plus the abundance value now what we saw was basically only molecular function the three different outputs as i mentioned before molecular function biological process and cellular component so in every case you will see the go id with the function the information regarding the function genus and species along with its abundance value in this workflow we have also incorporated tools such as combined metaflan and human output but before using the combined tool the first step is to normalize the abundance for that we use the renormalize tool gene family and pathway abundance are in the rpk which is reads per kilobase values accounting for gene length but not for sample sequencing depth while there are some applications example strain profiling where rpk units are superior to depth normalized units most of the time we do need to renormalize our samples prior to downstream analysis so for that we use the renormalize tool when we renormalize the outputs the outputs look something like this once normalized we use the combined metaflan and human outputs tool to combine the gene families or pathways together from the metaflan and the human to output as the name suggests it provides you the same information it will give you information regarding the genus the species their abundance value as well as the family id now if you would like to know which gene families are involved in our most abundant pathways along with the species we use the unpacked pathway abundance to show genes tool this tool comes from the human tool suite this tool also takes in the normalized pathway and gene families output as the name suggests it renormalizes the gene and pathway abundance in copies per million or relative abundance values it adds another level of stratification to the pathway abundance by including gene families this is how the galaxy wrapper for the unpacked pathway abundance looks like and this is the output it provides you with it gives you information regarding the most abundant pathway and the information regarding the pathway along with the gene genus and species involved it also provides you with the uniref id and their abundance value that concludes the tools present in the last workflow all the outputs from this workflow are tabular outputs which you as a user can further process according to your liking in conclusion after running the assigned workflow you can obtain all of the tabular outputs that is mentioned here with respect to taxonomy you get the information regarding kingdom phylum class order family genus species and strain the user can specify what level of the taxa we want the output to have and in the function category you get the input regarding the pathway gene ontology which is further classified into biological process molecular function and cellular component and gene family as mentioned before you can classify them into other outputs using other id mapping files please take a look at the outputs that you just obtained after running the workflow once all of the items in your history turn green in color here is the paper we just published in the F1000 research journal on the work that we just presented this concludes the hands-on portion of the training thank you for listening thus far now that we have learned about the pre-processing taxonomic composition as well as functional analysis using the assigned empty workflow it is important to note that the outputs that are generated by the assigned workflow especially the abundance values at the genus species or strain level or pathways gene ontology and gene family abundances at the functional level it is important to note that these values are generated for a single time point or a single replicate if one is interested to compare multiple replicates or compare various conditions it is important to take these outputs from various time points or various replicates and aggregate that into an output or an input that can be used for a tool called as meta quantum so for example here your metatranscriptomics data and the abundance values associated with the function and taxonomy can be put into a tool called as mt2mq which is short for metatranscriptomics to meta quantum which generates a tabular output that can go into a tool called as meta quantum now meta quantum tool was originally developed for metaproteomics analysis but is a set of our tools that can also be used for metatranscriptomics analysis quantitation which helps you to perform statistical analysis and generate visual outputs both for data exploration differential expression or heat map cluster analysis in order to understand a little bit more on meta quantum let's try to understand how meta quantum functions so meta quantum as I mentioned earlier was developed for metaproteomics analysis wherein mass spectrometry based identification of peptides and the intensity levels as well as function and taxonomic information is used to feed it into meta quantum tool which performs statistical analysis to give you an idea about the functional and taxonomic state of a microbiome this tool was developed by Caleb Easterly and has been published in 2019 in the molecular cell proteomics in order to demonstrate the use of meta quantum especially for metatranscriptomics data we will use the same data set that Saskia described earlier in the study in this in this workshop so this was a data set from Magnus Anson's lab wherein food waste manure was used to generate a microbiome that was serially diluted and eventually was used for degradation of cellulose and this particular cellular degradation with a minimal microbiome was monitored at various time points starting from zero time points to 43 hours for this study we used time points 23 33 and 38 hours also called as t4 t6 and t7 and this data was analyzed by Praveen Kumar Subhina Mehta and Marie Crane using the metatranscriptomics mt2mq as well as meta quantum software so just to give you a summary of the analysis there were quite a few gene families pathways that were detected as well as gene ontology terms at the molecular functional level that were detected and this was a simplistic microbiome in the sense there were only four prominent genera that were detected across the three time points when we looked at taxonomy or genus abundance one could find that co-pro thermobacter hunger-type lost radium and methanobacter methanothermobacter with the three main organisms that were present in this data set and as you can see initially at time t4 hunger-type lost radium was the was the dominant organism or the general genus that was present and it kind of decreased as as time progressed while co-pro thermobacter um abundance increased as time progressed we also performed principal components analysis based on the metatranscriptomics data and with that one can see that using either function or taxonomy you can see that time points t4 abc seem to cluster together as compared to the rest of the time points with the metatrop proteomics data which we also have available when we performed meta quantum analysis we actually saw a better separation of the early time points as compared to the rest of the time points we also use this for heat map analysis and this helped us to differentiate these three time points based on this based on the heat map analysis one of the interesting features of meta quantum is that it can perform this volcano plot analysis which helps you to detect any genes that are differential expressed and in this case at least here what we have shown are genes that are involved in cellulose degradation and these genes have have shown to be overexpressed when you comparing t6 over t4 or t7 over t4 one of the other interesting features of meta quantum is that it can also answer questions like what are the functions expressed by a particular taxon for example if one is looking at hunger type gloss stadium and when we looked at some of the genes expressed for cellulose degradation you can see that these genes are getting down regulated for hunger type gloss stadium while you can also ask other question in terms of if I were to choose a particular function such as glycoside hydrolyse or glycosyl transferase what is the contribution of various genera to this particular to this particular function and as you can see here various organisms or different genera contribute differently for this particular function now in order to understand a little bit more about these new updates for meta quantum as well as it's used for time course analysis or for metatranscriptomics analysis I will strongly recommend you to this manuscript that we published this year in the journal of proteome research which highlights not only these two features but also offers a step by step guide on how to use meta quantum lastly I think it's very important to understand the context on why we are performing metatranscriptomics analysis so for data sets we have metagenomic analysis we could have metaproteomics analysis as well as metatranscriptomics analysis and all of these three methods have their strengths and what what questions can be answered and sometimes you can actually get a better answer if you take a systems biology or a holistic approach wherein you can integrate these three methods and for this I'll strongly recommend you to go to this to this blog post by by this answer wherein they have implemented their metatranscriptomics and metaproteomics tools as well as metagenomics tools within the galaxy framework to answer some really interesting questions when they set out to take up take up this study you know since they had set up the experimental design for the study they had a very clear understanding on what are the questions that were that they were addressing so I'll strongly recommend that you can you should go and read about this because this might give you some alternative tools that you could use for your metatomic analysis lastly if you are interested in finding out about functional microbiome galaxy workflows not in meta not not only metatranscriptomics but also for metaproteomics I'll strongly recommend you to go to the galaxy europe the public instance as well as some of the tools and workflows are also available in the galaxy training network as part of the metaproteomics as well as metatranscriptomics tutorials as well as meta quantum tutorials if you have any project specific or tool specific or workflow specific questions please reach out to us at this at this contact email address through this website and we'll be happy to answer any questions lastly I'd like to mention that this was a work not only done by researchers at the university of minnesota and at the minnesota super computing institute but we have also collaborated with researchers across the world users experts in the field of metatranscriptomics microbiome research as well as software developers and training experts who have helped us to make this tutorial as well as dissemination of these resources possible thank you very much for attending this workshop