 Welcome everybody. My name is Saskia Hilteman and today I will guide you through the 16S rRNA sequencing with mother tutorial on Galaxy. So this tutorial is in the metagenomic section of the GTN. As you'll see there are two versions of this tutorial so both versions are exactly the same except the extended tutorial goes through every step manually all the tools and the short version consists of a series of workflows that you run which do the exact same all the same tools except it's a little bit faster to click through and you can focus more on the underlying concepts. So today we will do the short tutorial but at the start of every section it is possible for you to switch to the extended version if you would like more information about that section so here you see beneath every section the switch to the extended tutorial if you want to go through every tool step by step or you can switch to the short one if you just want to run the workflow and then examine the outputs from there. Okay so in this tutorial we will follow the standard operating procedure SOP for my seat data as it was developed by the creators of the mother tool package and the Schloss lab. So the mother tool package is the analysis tools we are going to use. What we will do we will cover in this tutorial first we will get our data into Galaxy of course and we will create a because we have multiple samples we will create a collection to make things a little bit easier in Galaxy to handle so we don't have to run every tool for every input dataset we'll make a collection so we only have to run the tool or workflow once for the entire set sample. Then the first step like always is the quality control we want to make sure we have high quality data so we'll filter and trim our data and clean it in other ways then we will do some sequence alignment then we will assign taxonomies to our sequences and there is an optional step here to calculate error rates based on mock community so often to assess how well this whole process both of sequencing and of the in silica analysis works you can sequence a mock sample so you create an artificial sample with known quantities of known microorganisms in it and ideally you would like to find exactly back what you put in and if you find other things you know there might be some contamination somewhere in your process or if you find them at the wrong rates you know there might be some bias so you can use this to assess the error rate of your whole process and we'll do OTA clustering and some diversity analysis and then finally some visualizations there's some some more information here about the 16S representable RNA gene that I covered in the slides also but we use this gene because it is highly conserved across bacteria so we can easily target it and we can use the variable regions that this gene contains to differentiate between different genus in our sample so the data we will use for this tutorial is also generated by the slosh lab the creators of the mother tool set so what they did was they had a very big experiment they followed a large set of mice for one year every day sampling their feces and sequencing it and for the first 150 days they just let these mice grow naturally and they wanted to assess sort of the natural variation in microbiomes between these mice so this is 150 time points so this is quite a lot of data too much for this tutorial it would take too long so for this tutorial we just take the first 10 time points and the last 10 the 150 day period for one of the mice and then we compare those two now a quick note about the data set naming scheme so you will find we will be using a bunch of samples named like this so it starts with f3d0 from this case this means a female mouse number three on day zero and every sample since we're doing paired end sequencing here every sample consists of two files two fast key files which are identical in name except for this r1 for the forward reads and r2 for the reverse reads okay so with that let's start importing our data into galaxy so before we start with our analysis let's make sure we have a new history so if your history is not empty please click the plus icon here at the top to start new history and give it a name you can remember so something exists 16s are in a order to make it stick so like i said we are going to take 20 samples 10 early time points and 10 late time points these are paired end sequences so there are two files per sample so in total we have 40 files now there are two ways to get your data there are links in the tutorial manual to some sonodo so you could copy all of these and upload them by the upload button as you've done before or many of these tutorials on the big galaxy servers also have the data in a shared data library so we'll demonstrate that method but if this is not available on the galaxy you're on please upload them by the URLs but to get to the shared data libraries at the top of our galaxy here we will go to share data data libraries and this will look a little bit different depending on which server you're on but you should see this gtn-material data library if we click on that we see here it's organized in the same way as the tutorials are themselves here you have the different topics and we'll scroll down to meta genomics click on that and then we see here at the top for me it's 16s microbial analysis with mother click on that you see that are different versions so we've updated the tutorial ones but let's just go with the highest number here and here we were just going to import everything so you can see here you will have a bunch of vasku files as expected so let's just at the top here select everything and then at the top there is this button export to history there are two options as data sets as a collection let's export them as data sets first and then we will make the collection later if you click that it'll ask you which history to import it to so just do it to our current one so the nice things about using data from the shared data library is that it doesn't count against your quota so it's sort of free data so if it if it's here I would suggest using it it's also faster than uploading from URL again and it saves space on the server that not everybody's duplicating their data sets so once you saw the green check box the green box indicating everything went all right let's go back to our main to our home page by clicking on analyze data okay now we see here we have 46 files in our history so we have 40 files that are our vasku samples and you see we also have some extra data sets and these are reference data sets we will use later on in the tutorial now the first step we want to do is to organize all our um sample data our vasku files into a collection so that we can run everything once on the entire collection and save us a lot of clicking so to do that we start by first why don't we filter our history to show us only the vasku files so not the other reference data so at the top of your history panel on the right there's a search bar and you can search for and filter for your history so if we type vasku in the top there we see that we found exactly 40 files as expected so these are all our sample data these are the ones we want to put into a single collection so we're going to check this a check mark icon at the top of our history um hover over it and they'll say operations on multiple data sets so we click that we select all and then we say here for all selected this button has different options but we want to make a list of data set pairs because we have paired and data and we have multiple paired and samples we're going to make a list of data set pairs now galaxy will try to automatically create these pairs for you automatically detect which ones go together which forward and reverse reads um you see here that it's trying to look for underscore one and underscore two now remember that for us it's not underscore one and underscore two but it's underscore r1 and underscore r2 so we can either edit these two fields or um galaxy also has some default filters so if you choose here on choose filters we can tell it okay our files are distinguished with this underscore r1 and underscore r2 notation now you see galaxy find some pairs um it's a little bit hard to see with these long file names but let's just auto pair and then we can see here it'll take these two files pair them together and call them f3d0 so that all looks okay and you see it does this uh made 20 pairs this way so everything got paired and then all that's left to do is scroll down to the bottom and enter a name so just say um samples again you can name this whatever you want it's not important the most important thing here is that you know what is in your collection and we had great list now we're back at our history so let's click on this checkmark box at the top again to go out of selection mode let's also clear our history filter here at the top so we're still only showing fast queue files but let's show everything so I can hit the x button here and now we see our collection with mouse samples and it looks like one dataset in in galaxy history now but if you click on it you see that actually contains 20 pairs and each of these pairs in turn contains two files one forward fast queue file and one reverse fast queue file so now we can take this um this entire collection and one run it through a tool and that tool will run 40 times or 20 times depending on um what the tool does with only one click so this saves us a lot of clicking these collections okay great and with that we have all our data ready and organized in galaxy and we can start um analysis with some fast queue or some quality control running so the first step of our quality control process will be to create context from our paired n reads so in our case um we have sequenced the v4 region of the 16s or rn aging we know that this region is approximately 253 uh base pairs long and the sequencing that we did was about 250 base pairs long so we're doing paired n sequencing so um one read was sequenced from um the one side of this 253 base pair along region in one direction and the other one was sequenced from the other side of this 253 region in the other direction so as you can imagine there will be a lot of overlap in the middle between these forward and reverse genes in our case so we're going to find that overlap and use it to improve the quality of our sequence so here you can see an example of how this can improve the quality so imagine you have this forward read and this reverse read so and you look here you take these two and you find the best overlap so you see here that if you overlap them like this with um two um base pairs overhanging at each end then you have a pretty good match in most places there's only one mismatch but that could easily be a sequencing error so what we do here is we take uh if there's a match obviously we say okay the consensus sequence here is uh is an a same for c because they both agreed we can also um improve our our um quality score here so let's say this a had a quality score of 10 and this had 20 now that we know that both of these agree on that position we can be more confident that it's actually an a so we can increase this um quality threshold same here now if there is a mismatch of course we will take the one which has the higher quality of the two so here the one read thinks it's an a in this position the other read thinks it's a t um there's a higher confidence score of this t so our consensus sequence we're going to take this t but we're also going to lower our confidence a little bit because of the mismatch in the other read and of course in the uh a few bases at the ends that don't overlap we just copy straight to the consensus sequence so we call this consensus sequence a contig and that will be the first step we do and in mother there is a tool called make dot context to do this so let's get back to our galaxy and search in the tool war for make dot context and click on that tool over a little bit so here first it'll ask us how we want to provide our uh forward and reverse fast queue files so if you have only one sample you can just give two separate files but we have multiple samples so we're going to say we have multiple pairs and that we have them in the collection and then we only have one collection or history so it'll auto detect that one and the rest will i'll leave to the default uh parameters and just hit execute at the bottom so what this will do is i'll for each of the sample it'll take the forward reads and the reverse read find the optimal uh overlap and generate a consensus sequence a contig with a little bit improved uh quality not what this tool also does is is it takes all these um these consensus sequences and puts it in one big file and so in the manual while we wait for the results we already have a look at what the output will look like so it'll give one big fast file with all the sequences and then in order to remember which sample each sequence came from we also have gets from these this tool something called a group file this is a very simple file format which just has as the first column the read name uh and then the second column which uh sample it came from so these ones came from um female three day zero and there will be some at the bottom that are from a different day so this is all to make uh things a little bit simpler and mother so we now have we'll have one big fast file with all our sequences and one group file and we just need to provide both of these together to the subsequent tools to keep them up to date okay so while we wait for um a context to finish here um we can have a sneak peek at the next steps so after um we have these contigs we will do um some data cleaning so so we will filter by length so we know that um our target region the v4 region is about 250 base pairs long 253 um so if we have our two reads and the contig and there wasn't a good overlap that probably means there or something went wrong with sequencing so something is more than 275 base pairs long um then we're going to remove it we're also going to remove lots of low quality contigs so if the consensus quality is below a certain threshold probably means there were lots of mismatches or there was a low quality confidence score to begin with and we are going to to take those out as well and lastly because we are sequencing a microbiome we in a highly targeted um area of the the genome at that um we're going to deduplicate sequence so we'll have many identical sequences likely and for downstream analysis it doesn't matter it's a waste of computation if we take the exact same sequence and try to map it to the exact same reference genome it's better to just take every unique sequence once and just remember in a separate file somewhere how many times this unique sequence was encountered so we will do that with um this unique dot six tool now we see that now our made context has finished in the meantime so I'll just give you a quick look at that so you can use the i icon to look in different files so you see it's made a bunch of files here so the first one is um trim context dot fasta so these are the trim sequences we see here this is the fast a format so we have just the file name or the read name i'm sorry and the sequence so no longer the quality information like the fast q file um mother stores the qualities in a separate file so you see here the trim context dot qual so they like to keep that separate so here you see again the read names and then you see a list of quality scores here the fred scores now the other um thing it makes is scrap context anything that couldn't find any overlap between will be in here for us it's zero line so that means that um we had pretty pretty okay data but normally you could see here anything that did was not able to form a contact um there's also a short report here just to show you exactly where the overlaps are and how much overhang on each end in case that's useful and here is a group file that i mentioned so this one is important so this tells us um for every read name so you see here this you saw in the fast a file and the quality file as well which sample it came from so we have um a lot of f3d zero and if we scroll down long enough we'll find um here some uh read names from different samples so the fast a file and the group file is what we will um give to other tools to do their analysis on so to do the data cleaning step go back to the tutorial and like i said in this short version of the tutorial um we are going to run these three steps um as a workflow just to save a little bit of clicking so you can find the workflow uh is linked here in the hands-on box so if you copy the url for example right click it and copy link location or you can just click on it to download it to your computer and then re upload it to galaxy but i'm going to use the url so i've got the url um copied so um the description of what to do next is also here in this box but i will show you so go to galaxy and go to workflows at the top here and you see i have a lot of workflows already uh you might not have any yet that's fine there are two buttons there at the top so if you want to create one from scratch you can use this create button but we're going to import it and click the import button and then we can give it a url so just put in the url you copied it should end um galaxy workflow files always end in dot ga or dot yaml nowadays but it should look something like this and then we just hit import workflow and i should see it at the top of your list workflow one quality control for the galaxy training system s analysis with mother cool so um if you click on the name of a workflow you can get some more information you can view all the steps you can rename it to something um you remember a little bit better you can make your own copy so that you can make changes if you want and if you hit edit you will see um exactly what this workflow does in sort of a step by step way so you see it'll take the context the fast stuff file and the group file and for a not a number of tools so uh screen dot six will do the screening it'll do the filtering and trimming uh unique dot six is the duplication we talked about and summary dot six is a tool that gives us some summary information about our data set so it doesn't do anything it just reports on our data set but let's go back so you can either hit the play button at the top right here to to run your workflow or if you are in the workflow menu instead of clicking on the name if you go to the right here there's also this play button to run the workflow so this is what we're gonna do so you can always choose when you run a workflow to send the results to new history or not i'm gonna keep it in my current history but this is really up to you so you see that this workflow uh requires two input files it wants the context so this will be the output of make context make sure you select a trim dot context not scrap scraps we throw away and trim is what we keep so uh make context trim context not faster for the group file uh make sure it's the uh group file that was output by make dot context you can scroll down to see a little bit of the other settings and the tools that will be run but we'll leave all of that to the default values and just run the workflow so you can see here it'll you can see a little bit about the workflow implications so there are seven steps here that were scheduled and none of them are complete yet so i just have to wait i return so um i think stay gray a lot that usually means that it's a little bit busy on the server um that's fine if there's a lot of people using galaxies at the same time you might have to wait a little bit longer than other times to come back later or continue going because in galaxy you can always already schedule the next step even if the previous one is not finished yet okay so you see my jobs are getting scheduled now so we'll give that a minute to run and i'll already show you a little bit about what the output files will look like the file formats while we wait in galaxy okay um so like i said um it will the screening step um it will do some some filtering and cleaning and it'll remove some low quality reads so uh one of the questions uh i would like you to try and answer when your data is ready is how many sequences were removed in the screening step and how many uh unique sequences uh do we end up with in our clean data set and now because of this um this a unique tool we have to remember a third thing now so we already have the fast a file we have a separate file to remember um what sorry which um sample each of our reads came from and now we also have to remember each of our unique sequences how many um total sequences does that represent um so we can um save both those or capture both those things in a count table and this is will be output by the unique tool and this count table looks something like this so for every unique sequence there will be a line the representative sequence it's called and here we can see this will be a big table see that in total across all samples the sequence represented by this uh sequence was observed 4400 times 370 times in sample f3d0 29 times in sample f3d1 and so on so now we have um both that information like what sample did these read come from and how many times was this unique sample observed in our count table so from here on out we will have our fast file and our count table that we provide to downstream to have all the information so i will show you the output of summary dot six soon this will give um information about our current data set and it'll show us both how many of the unique uh samples we have and how many and the total sequence represent check it's still running so we can see the first summary dot six before cleaning so we do a summary before the cleaning set and a summary after so we can see how well this worked let's just have a look at this output now so you see here it starts just in general information about the tool on the version but if you scroll down a little bit and this table is what we want to see so it gives us a little bit of information about um so see here um the length of each of the um all the reads so you see most of them are about as long as we were expecting so we know that our region is 253 approximately um so we see that um 97.5 percent are 253 or smaller uh in length there is the longest one we have is 502 in length so this is very likely a pair that had almost no overlap um so that's why it's um only few bases overlap so that's why it's so long so we don't really trust this data because there should be a lot of overlap so there's probably something wrong in the sequencing steps so in our data cleaning we want to remove um reads like this but we also see from this that it's um less than two and a half percent that are this long and on the lower end we see that the smallest uh one is 248 base pairs long so that is a pretty good tool that could easily be a deletion um a real deletion in the sequence it gives us a little bit of information here ambigs send for ambiguous bases so the number of calls that are very low quality um that we're not sure about so we see that again the highest one um is 249 so this is the one that did not overlap well so that we are very unsure about all the bases um and here the 97 percentile has six so we probably want to filter out some of these um these higher scores we can see the total number of sequences so we have 152,360 um total sequences you see how many are in each percentile and here the last column is a polymer so this is homopolymer stretches so how many times the same base is observed in a row so a lot of time sequencers are known to be bad at homopolymer stretches so they um it's hard for them to distinguish whether this was like eight a's in a row or seven or or nine um and we also know for our region that there shouldn't really be more than um five or six um of the same base in a row so if we have significantly more than that we should also filter it out so um screen dot seeks is the tool that does this filtering so we see it's still running but here we have given it a um threshold of zero so any reads with any ambiguous bases we throw out we know that um we won't throw out more than uh two and a half percent here um so we can be quite strict since our data looks quite good uh you might want to set that to one to allow for some ambiguous bases but uh we're going to be strict and we set the homopolymer stretch to eight so um some of the worst um reads will be filtered out this way okay perfect timing screens and dot seeks is just finished um so here as output we get again our FASTA file we see that it has filtered it down to 128,872 sequences down from 152 so the difference there is um numbers uh reads we removed we can find which reads those were in the bat dot ag knows accession numbers file um and if we look inside this file we can see which reads were thrown out and what the reason was so this read was thrown out because it had too many ambiguous bases more than zero and this read was thrown out for two reasons both had 20 ambiguous bases and it um it was too long so this is always good to look at you always want to make sure that you're not being too strict in throwing away too much data and it's always good to check the number of data sets that are thrown away okay so after that cleaning we will do another summary dot seek set boy a little bit for that to finish and you see that already um has taken a unique uh sequences so it's taken all our cleaned FASTA um reads and tried to look at how many duplicates there are okay and please try see if you can um find how many unique sequences there are and how many duplicates to work it's pretty easy to answer in this case you click on unique six you see the number of sequences so 16,426 and we can see that we started with a 128,000 so the vast majority here are duplicate sequences but this is expected since we're sequencing uh a lot of the same organisms are present and they will have the same um 16 as before region so this is not unexpected for this type of data so let's quickly look at the summary after cleaning so if you look at the eye icon here scroll down again we see here now that it looks better so um our maximum is no longer 500 between 170 which is actually the limit we set um all our data has zero ambiguous spaces and one yeah and the total number now is 128,872 and then the final step we did was to create this count dot count table using count six tool and here uh is all the information about um this read uh represents 440 for uh 4,402 total sequences and you can see the number of times this exact sequence occurs in each of the samples so this count table um we will provide together with the FASTA file and downstream analysis okay and next step is to take these cleaned sequences and we will um align them to our reference so you should have seen some uh more general information about sequence alignment or mapping in earlier tutorials there's a link to the training materials here if you have not yet this step isn't always done so sometimes the clustering into otus is done without alignment step first but the creators of the mother tool have shown this improves the clustering if you do alignment uh to the reference uh first so uh that's what we will do so there's a tool in mother called align.seqs to do this so please again the top of the tool panel search for align.seqs and here we have to provide it with um are you set of unique sequences so the FASTA output from unique.seqs we have to give it our um reference database so we use the silver reference database um we have to select here that we want that we have this file in our history so the cache references are ones that are preconfigured by the administrator of your galaxy server and if you want to provide your own you can select this to your history and now we are going to scroll down to one of the files we imported at the beginning and it should be silver v4.FASTA at the end here we're going to take that as our reference genome or our reference file and we will leave everything else to their defaults and hit execute now what this will do it is it will try to it'll take each read and try to align it to the best place in the uh reference um the reference the silver reference um so again we can use this to do some cleaning because if our read for example does not map to the before region as expected um we this read is not very useful to us so we'll throw it away so after alignment we can have a look at where all these things lined up to see how well these line this might take a few minutes so um if you want to get a cup of tea or coffee now is a good time and just pause the video and come back later okay so that was pretty quick actually so mine has finished um let's just have a quick look at the output files here so we see we have two output files here we have the align sequences I'll show you quickly what that looks like so for have again um all the read names and then below that we have it looks like just a bunch of dots now but if you scroll to the right you're a bit patient you will see that uh so we have to get to the v4 region of the reference you will see that somewhere reads will start to show alignments so we will use a tool in a later step to sort of cut off all these uh unnecessary dots make things a little bit easier but here you see we have come to the region where reads have started to align so here are the two lines of this position then there's a gap in the ac but of course this is a huge file with s16000 or so reads in it so it's not very nice to look at the align report contains a little bit more information here so here you can see for each read um how long it was and where it um where it aligned and what the score was so search means it mapped per 100 means it mapped perfectly uh so this is a percentage um and you can see also um where it's more information about how and where what settings that were used and where it mapped okay but um actually we can use summary dot seeks again to get a little bit nicer um summary report here so let's do that next so we're gonna search again for summary dot seeks this time we are going to give it um the uh align output from align dot seeks and we're going to give it also the count table so that it can tell us not only the number of unique sequences but also the total number of sequences these represent so it can give a little bit more complete summary of your data if you provide the count file um and that's it and that's and that we say also we want to output the log file because this is where the table will actually be so we hit execute that shouldn't take too long to complete once it's my turn on the server but I can already show you in the training materials uh what the output will look like so you probably recognize this by now it has the um quartiles uh the percentiles here again this time it has also the place in the alignment where it um lined up so we know that in our reference the before region um goes from 1900 to uh 11 500 so anything that aligns here is expected and anything that aligns very far away is probably an error so like this the earliest alignment starts at 1250 so this is probably um wrong so we're going to throw this away and similarly the maximum is then 1982 it's probably also wrong but we see that the bulk of our data aligns exactly where we expected so we can just throw out the outliers and now since we provided it both with our FASTA file and the countable we can see that all this represents 16 000 unique sequences and those in turn represent 128 000 total sequences and here this last column also shows the total sequences not just the unique ones so that's why providing the countable here is nice because it gives you more accurate information okay so our tool is finished now I just quickly show you that it looks the same but if we go to the log file output and scroll to the bottom we see this file that I just showed you from the training material uh so that's good we can do after this a little bit more cleaning like I said we want to take out reads that don't align to the expected position just to make it a little bit easier so you saw here we have all these oh sorry in the align file we have all these dots everywhere so we'll just cut columns out that for all reads have a dot in that position just to make it a little bit easier so that is done using filter.seqs tool screen.seqs that we used before for data cleaning also clean alignments so we'll use that and then the final step we'll do in this section is pre-cluster so we are going to look at all the reads and very highly similar ones we are going to merge together so anything in the region of sequencing error so any two reads that have only one difference one base difference that is likely sequencing mistake not a biological difference so pre-cluster we're going to say any sequences are so highly similar we are going to treat them as one unique one and the last step will then be chimera removal so I mentioned this in the slides if you watch those so a thing that can happen during PCR here is that something goes wrong with the PCR amplification whereby you get sort of two half sequences and these maybe form together a hybrid sequence so chimera and greek mythology is the name for a creature that consists of multiple creatures for example the head of a lion and the body of a of the bird something like that and in PCR it's a sequence that actually comes from two separate sequences so this can affect your downstream analysis so we have to clean these out too so those are the steps that we will do in our next workflow so again our same process copy the link to this workflow by right clicking or you can click on it to download it and then we upload it want to copy this link location go back to galaxy go to workflow at the top import button at the top give it the URL if you've downloaded the file you will use not this top box but the next one and you can browse your computer for the file make sure you give it the red thing so again this one ends in dot ga and workflow to data cleaning and just hit import let's see we have workflow to here and we have a quick look again to see what this workflow does zoom out a little bit here at the bottom takes the line sequences in the count table it does a cleaning screen that seeks uh summary is always to get some information it does some filtering so those alignment symbols it'll remove it'll do some pre-clustering so a highly similar sequence it'll cluster together it'll look for chimeras and remove some files so to remove anything that found to be cluster together and then we do another summary away at the end of that so let's run that it's time by hitting play again on the right run the workflow and it should be pretty self-explanatory we're gonna send the results gonna keep them in this history for the aligned sequences make sure you give the output from align.seqs and also the latest count table and we run so I'm gonna give this a minute to run if your analysis is finished see if you can answer these questions so try to look at how many chimeras were detected and how many sequences remain after all these extriculating steps so remember we had about 16 000 something unique sequences before we went in so just see how many this chimera filtering and this filtering on badly aligned sequences how much we removed in those two steps with the answers just as soon as my files are finished okay so my jobs have now finished so let us look at the outputs so remember that we started with screen.seqs to remove any sequences that didn't align well so let's go down to those okay so we can find out here how many were removed and for what reasons so if we go to the bad.acnos output of this tool we can see that there are 128 lines here so that means that 128 sequences were removed at this step because they didn't align at the expected location so not over the v4 region and if we click the icon here we can again see the read names and the reason so maybe the ending alignment place was unexpected or the start was was too early or too late and we also in this step filtered for homopolymer stretches so that we we know that our v4 region does not have more than I think five homopolymers a stretch of homopolymers of five so anything over eight we have filtered out here for extra cleaning then we did at the filter.seqs so I will show you the output of this this just removed made a little bit more readable it removed the entire section at the beginning of the alignment that was just dots because that's the the region before the v4 region so this basically just cut out the v4 region for us and then we again obtained our unique sequences if we look here we now have 16 000 unique sequences and if we want to find out how many these represent we could do another summary.seqs the next step was to do the pre-clustering so we take all the sequences and try to identify pairs that differ with one or two bases at most and we're going to assume that these are sequencing errors not a biological difference so we're going to assume that these are the same and we're going to generate a consensus sequence from those and we can see now if we expand this one this was a major reduction step so here we see that we ended up with 5,720 unique sequences down from 16 000 before okay we can examine how many total sequences those 5,720 represent by clicking uh checking the summary.seqs so if we check this output and scroll down to the bottom we can see that those number of unique sequences in total represent 128 000 total sequences the next step was to um detect these hybrid sequences these chimeras you can see here how many there were if you click on the vsearch.acnos output of that tool you see that now we have 3441 uh unique sequences left after chimera detection and if you would like to know how um how many that is um the remove.seqs step actually removed those sequences so the chimera only does a detection chimera tool and the remove.seqs tool removes the the found chimeras so look at the latest summary that.seqs that's labeled after chimera removal to see how many total sequences were chimera so we see that now instead of the 128 000 we had before the total sequences are now at 118 000 0 9 1 okay so now our data is optimally clean uh and now we can move on to uh probably the more exciting part of this tutorial we can go classify our our sequences okay for taxonomic classifications there are a number of different approaches you can use here i am going to start the workflow first and then i will explain a little bit about each of the steps so we're going to import the workflow again we're going to scroll down to the section removal of non-bacterial sequences um copy the workflow url go back to your galaxy click workflows at the top import paste the url and hit the import button it's appear at the top of your list and we're going to run and ask us for four uh input data sets so we want the latest um festa and countable these are from the remove.seqs tool and for the classification we're going to give it um the reference data so this is from rdp we're going to give it a reference taxonomy and reference a festa file so the taxonomy should be the only option you see and for the fourth one we're going to move scroll all the way back and we're going to find the one that ends in pds festa train set nine underscore some number is pds festa and run workflow okay while it is running uh let's see uh what these steps are actually doing so like i said taxonomic classification there are several approaches you can use here there are different tools uh you can use there are different reference databases you can use in this tutorial we will use a Bayesian classifier via the classifier.seqs tool from mother and mother also provides a training set based on the rdp um project so this is a random data based project and we'll use that as a reference taxonomy if you would like to read a little bit more about different methods of taxonomic classification and their effect and this background box has a couple links to some nice papers so here you see an overview of a couple of different approaches that all lead to taxonomy assignments and for popular reference databases one thing i just wanted to highlight is that the choice uh choices you make here as with every tool they affect your outcome your results so this paper also did a nice comparison of different methods so each of these panels here is a different method each um each graph is a different data set and here on the x axis are different regions that were sequenced different let's say different v regions and you can see here that the different choices you make can affect your results for example in this mouse gut data set if you use green genes here you see that um you identify mainly two different taxonomy so this is relative abundance but if you use here a different approach you see you'll find a much more richness and in the same way depending on which um region you use even in one method you will find different results um so if you use this region here you see you get a lot of this purple acetylbacteria but if you use this different region you don't identify those this is just all to say that these choices matter and you should understand your sample and um depending on your type of sample the question you want to ask um different approaches may be better for you um so depending on the region you sequenced also um so there's a nice discussion about this also in this paper linked here okay so after we do the classification um we might still have some non-bacterial sequences identified so there might still be um some 18s fragment gene fragments in there there might still be some 16s fragments from RK air chloroplasts or mitochondria material might have survived and these can now be identified after this classification step so we will also remove those okay mine is still waiting to be run so I'm just waiting my turn but after that is done we will look at the results and when yours is done see if you can answer this question how many non-bacterial sequences did we remove did survived all our previous cleaning steps so just see if you can find both the number of unique sequences and the total number of sequences those represent that were removed because they were determined not to be bacterial in origin so now that my tools have finished let's look at the outputs so we did run the classifier if we look at the taxonomy output we see here that each of the reads has now received a classification so it's been determined that this is kingdom bacteria um phylum bacterioid is and etc down to the genus level and you see the number here in brackets is a confidence score so yeah and this is done for every um every unique sequence so now we have a taxonomy assigned to each of our reads now based on this we might find some non-bacterial species and for our analysis we are not interested in these so we are going to remove them and we did this with the remove dot lineage tool and one of the questions in the tutorial was how many did we remove here so to answer this you can look at the summary dot seeks output again before and after so one nice feature um galaxy you may have uses already maybe not this is scratch book so at the top here you see this sort of grid icon if you click that you can open multiple files side by side to compare them and we would like to do that now with the summary dot seeks before and after this um this classification and removal of non-bacterial sequence so we're going to click on that and then open the two summary dot seeks outputs so we're going to scroll down to the one before which is labeled after a chimera removal press the i icon so now we see we get a little window on top of galaxy here which we can move and resize as well or should be able to with this i'm going to scroll down to the interesting part here and then we're also going to click on the latest summary dot seeks output and we're going to move those around a little bit to be next to each other so we can easily compare so i now on the left on the left half the new output and on the right the old one you see that before this step we had uh 2279 unique sequences representing 118 thousand total and after the step we have 2259 unique sequences so 20 unique sequences were removed representing um 100 something total sequences so that is how you can easily compare two files to find the answers okay the next step in the training manual is to calculate error rates based on a mock community and i will skip this for today but it's quite interesting if you have the time to look at this but i will explain what this section does so a lot of the time when people sequence these types of samples they will also co-sequence a mock community so this is a artificially made sample with known quantities of known bacterial species and if you sequence this and analyze this in the same way as your samples you can use this to get sort of an estimate of how well your methods work so there is a nice paper linked here that explains this a little bit um so here you see on the left they had um i think it's something like 20 different mock sample with 20 different species at equal proportions so in an ideal world you would like to find this exactly back all these species at the right proportions but they found that depending on which region you use these results can differ a lot so it is always very important to really think about what about your sample and about your research question and find out which methods and which regions are best to sequence for you because it does have an impact um so in this tutorial we also had a mock sequence like this in our data set so if to estimate the error rates in our method we would then take this mock sample we would compare it to the sequences we know should be in there and we can see if all of our reads map to that set of sequences that we know should be in there then we are very good if there are a lot of them that don't match any of those and then we know there might be some contamination from somewhere else in our case this was pretty good so then we can feel more confident about continuing analysis with our other samples as well but we will just skip right to um OTU clustering for our real samples so the next step OTU clustering so remember OTU stands for operational taxonomic unit and basically this is clustering of similar sequences to a certain threshold so we are going to cluster sequences that are 97 percent identical and this is sort of the variation you expect for different organisms from the same genus it's a genus level similarity so as I've said before with 16S this is usually the best solution you can get and many of these tools will give you go down to species level but really you can only trust these results usually down to genus level and then you can make nice overuse like this and say how many which clusters we have and which taxonomy and at which proportions so that's what this next workflow will do for us so again import this workflow in the usual way copy the link let's go back to our galaxy workflow at the top import paste in the URL should be workflow five OTU clustering import okay I'm going to start the workflow again and then I will come back to explain what what this workflow does so here we see in once three different inputs so we're going to give it the latest ones just make sure to give it the right one so okay so for the sequences we're going to give it the latest one from remove lineage count table the same one from the same tool remove lineage and for taxonomy also the one we got from remove lineage so that should be relatively simple and we run the workflow we can have a quick look at the steps in this workflow so if we go to the workflow menu click on our workflow and edit we can get that nice overview again of the different steps here so this first step removes just removes the mock sample from our collection because we're only interested in our real samples and we are going to make the clusters classify each cluster and do some self-sampling for the next step okay and once this workflow has finished we can examine the results again so the most interesting results here are the outputs from the classify that OTU tool so let's first look at the taxonomy output here so you see this is a collection which contains list of files here we could have indicated that we wanted to do this clustering at different similarity levels we only chose to do the 97 similarity which here is notated in reverse as 0.03 but this means the 97 percent similarity threshold for the clustering so if we look at this data just going to zoom out a little bit we see that here we have a list of OTUs so all our clusters are just given a number OTU one through however many we found we see here how many reads were assigned to this cluster so we see that this one's most prevalent and then to see what that was classified as we see the next column so there are very many clusters here 533 lines in total so minus the the first line the header line that's 532 different clusters okay the other output from this tool is the summary file and again if we look at that you need to zoom out a little bit because this is a big file and this one we can also see for each of these OTUs that were classified how whether this was present in each of the samples so here you can see the number of reads in each of the samples so the question in the training manual is can you find which samples contain Staphylococcus so to answer that you would search for Staphylococcus here okay we see it's that line and then we're just going to see for each which samples have the non-zero number in that column so a little bit more so this was our line and we see that here the sample day 141, 142, 144 they all contain this OTU and the others don't okay so now that we have done our clustering and assigned taxonomy into our different clusters we want to do some diversity analysis so we want to say something about how diverse our samples are and there is some nice description in the manual if you want to read some more about this but broadly speaking diversity consists of a number of different concepts so unfortunately this isn't a physical thing that you can measure but there are different metrics devised by different people to describe diversity so often this comes down to one year species richness so this is a measure of how many different species you have in your community there is this species evenness so how a similar in number are those different species in your community so is one maybe make does one species make up 98 percent of your sample and the second one the other two percent or is it more of a 50-50 type relative abundance so if one of them dominates you would say it is less diverse than if these and species are more even in number and then it also matters how closely related all those different species are so if you have a community with maybe 10 different species but these are from the same genus you would say that is less diverse and if you have 10 different species but they're all from vastly different filings for example so these three concepts the number of species the abundance of the species and their genetic relationships make up this concept of diversity but like I said there are many different ways to describe and measure diversity so these are all names of different diversity metrics calculated in a different way to try and capture this and this is not even a complete list just for you to make sure you understand these concepts try and think about this question in the in the training material so let's say we have community a here which looks like this and community b which one is more has more richness and which one what can you say about the evenness so take a minute to think about that and then we will come back to the answers so here the answer is the richness remember is the number of different species so we can just count them here we have in both of these we have four we have yellow red brown and purple so actually the richness here is the same but the difference comes in evenness because here we see that in community a the yellow mushrooms really dominate and the other ones only have a few and here it's more evenly distributed so typically we would say this community b is more diverse because of this difference in evenness but even if we had two communities with similar richness and similar evenness also the genetic relationship between them like I said might play a factor in what we label as more diverse than another so for example here this shows this is a phylogenetic tree so things closely related on this tree are genetically similar so let's say if you look at blue it has these three four six different species in it but they are all in the same same tree here while this red one is more diverse because it has the sum in its top half of the tree and also samples in the our species in the bottom half so we would say that red is more diverse because it is more genetically diverse okay so there are two broadly speaking two types of diversity so there's alpha diversity which speaks about the diversity within one sample one community how like we have been doing before and then there's beta diversity which says something about comparing different communities to each other so one way that we can estimate alpha diversity something that is helpful here is to make rarefaction curves and so basically here we want to we plot the number of otus compared to how much data we use so we want to we subsample the data and say what if I only use 10% of my data how many otus did I find what if I only use 50% how many otus would I have found all the way up to 100 and what we would like to see is a curve such as this green one which indicates that even that levels off in the end here which would indicate that even if we would sequence more we wouldn't necessarily find more otus so that means we have captured the full the full richness of our community if however we ever see a line more like this where we see that it is not leveling off that would indicate that had we sequenced more deeper then maybe we would have found more more otus more clusters more species in our sample so that means we have did not sample enough so this is also a nice way to estimate whether you your sequencing was good enough to capture the full diversity in your sample and again there's more reading suggested here if you would like to know more about these concepts so we are going to make this plot for our own data as well and do some alpha diversity so this is what this next workflow is for so again we will copy this link go to our galaxy import the workflow okay and then we are going to run it and all this asks for is a shared file so this is one of the outputs that we made in our previous in our previous workflow so this is another dataset that let's have a quick look it shows us for each of the samples and each of the different similarity levels how many reads were assigned to which otu this is a very big table this has all the information about the the clustering in it is the important thing so we're going to take this file and we're going to give it to the alpha diversity workflow and just hit run okay and once these tools are finished we will have a look at the output again so first let's look at the output of summary dot single and here we see that for each of these samples we have different diversity metrics so the sob that's the number of observed sequences and for simpson it's very popular diversity metrics so these are the different measures but the other output that we generated is the refraction plot that I mentioned if you would like more information by the way about these different metrics and how they're calculated there are links in the training material and the refraction plot let's see if we see this expected drop off so here each line represents one of our samples um on the y axis is the number of otus we observe using different number of sub samples so what this algorithm does did it took our dataset and it said what if I only um had 10 percent of the reads how many otus would I have found so it's up samples it down to 10 percent and then calculates the number of otus and does this for different values um down to like the total data we have so we see here that for most of our samples we see that it has started to drop off a little bit but they're not exactly flat yet so probably if we had sequenced a little bit deeper here we would have discovered a few more otus but at least the the flattening off is happening okay um the next step will then be to uh compare different samples so this was all about um per sample metrics and now we want to um compare these different samples so this must different days and see um how the the diversity compares between those time points so that is what the next workflow will do we will create some um diagrams some heat maps and um trees and other outputs to sort of visualize diversity between samples so let's start it first and we'll discuss the outputs so again your pros at this now copy this uh workflow link and import it to your galaxy okay and then let's run it okay I'll ask for two inputs the shared file so this is only one option for me and uh the sub sample shared so we're gonna here select um the output the shared output from sub sample there was an error here before but it changed now um when I changed the the file to the correct one and that's it let's hit run again this will take a minute or two to complete but I can already show you the kinds of outputs we're gonna get so one of the outputs here will be a um a heat map and here you have um each of the samples on the y-axis and again on the x-axis and the the brighter the the red the more similar they are so we see here that this sample which is uh it's zoom in to read that day four is very different than than the other ones here um than the late time samples we can see this in the high resolution in galaxy itself later and did this for different calculators or different diversity metrics so for beta diversity again it's similar to elf diversity there are different um metrics here that were defined by different uh researchers and groups of scientists and that are used and depending on which which one of those you use you'll get slightly different results another thing we will have done in this workflow is we wanted to see sort of the similarity um between samples so we've made a Venn diagram so obviously this doesn't work for a large number of samples but we did it for four of them so if you have a small number of samples you want to compare um you can use this Venn diagram so you see here that 67 uh oh sorry 76 otus were present in all four of them all four samples and there were quite a few in each that were unique to that sample so this can also give you a little bit of a feeling for how similar these different different samples were and then we have also a tree view so this is a phylogenetic tree which also indicates how closely related um samples are on average uh generally speaking so here I tried to cluster them on um on similarity so you see here that this whole group so these are actually all the early time points is clearly different from the um all the later time points except day zero which is perhaps a bit odd but you can see that these all um quite close together and these as well so that is approximately what we would expect to see too okay let's see if uh galaxy's already ready okay we're gonna wait a little bit longer so those are the two main types of diversity elf diversity looking at within one sample and beta diversity across samples there are many metrics here you can use they all measure this try to capture the same the same thing diversity of the sample okay so you see that galaxy also nicely gives you sort of a progress bar so if you have jobs to generate a list you can see here that this one is halfway done and if you look at it too you can see already um some are done some two are still working and one is cued and the ones that are done you can already have a look at so here is uh are the Venn diagrams that I mentioned before so zoom in a little bit here so now you can read the tags better you see here that day one is very different from from the high days here um so these are nice outputs to examine for your samples okay just gonna wait till this is done and then you can see the other outputs but we already um discussed them uh in the in the tutorial I'm just going to move on to the next step um which is going to be also the last step and we're going to do some more um visualization here so one really nice tool to do visualizations is krona so I already put the output also um in the manual here but you can see that what you end up with krona is an interactive plot to explore your sample so these are all the different uh to use in classifications that we found in our sample and here if we want to look more closely at the for for my kudies you can double click those uh and you see that this pie chart changes and you can zoom in here on specific things that you're interested in and then you can see um and you can go back up by clicking in the middle so this is a really nice tool to sort of explore your sample and the community structure here we see 16 percent of these were lactobacillus if we can go back up to uh all so all bacteria 100 percent five percent of those were this um so this is a really nice tool so let's make this this plot ourselves that we're just going to run the tools not a workflow this time so first we are going to prepare our um taxonomy file to be a compatible format for krona so we made a special tool for that called taxonomy to krona so type that in the search bar and you'll pop up here convert a mother taxonomy file to krona input format so this um just rearranges the the format of the data the columns a little bit to what krona expects so let's run taxonomy to krona so for the taxonomy file we are going to give it the output from classify.otu but um this was a collection and so we're going to click here on um this right most button in the folder so that we have this option of the taxonomy output from classify.otu and that's it so this will just do some transformation of that file so that afterwards we can run the krona tool here by choosing krona pie chart is the one we want and then we can tell it that we have a tabular data set this is what the previous tool is making and we're going to give it the output here of taxonomy to krona as soon as it's finished so we see that this is also a uh collection with one element in it so we're going to again choose collection and taxonomy to krona and execute. I'm going to have a quick look at uh what we gave to krona what that looks like so this is a very simple table that just tells us the number and then this is the entire hierarchy kingdom phylum all the way down to species or genus it's a pretty simple file and because krona does not care what the data is it can visualize any hierarchical data okay so when this is finished for you as well try to um have a look play around with krona try to see what percentage of your sample was labeled lactobacillus and the answer the solution is here in this box now this um krona chart is one that shows it for all your samples together if you want to get individual level or plots per sample there's also a section in the in the training materials to do this as an exercise video so this is really a nice sort of final output where you can explore the structure of your your tutorial of your sample i'm sorry so another nice visualization tool is finch so this is a tool outside of galaxy but we can access it or open it from within galaxy and this will let us explore sort of the um the or compare different samples in our browser so that's very nice um to do that we have to first make a biome uh file so this is a kind of a standard file format um across different metagenomics analysis tools so we're gonna use the tool make dot biome to convert from the mother format file formats to this more standardized format so here it'll ask us for a shared file so we're gonna give it just our latest one um taxonomy we're gonna give it again it's a collection but from uh classified otu and they want some metadata so this is another thing we uploaded in the very start so we're gonna scroll the way all the way down to the bottom and find let's see it's called metadata got metadata in the name so go past all our fastq files here mouse dot dpw dot metadata and xq let me give that a minute to finish okay and once our um make dot biome file uh the tool is ready let's click on that it's a little bit of a nesting of the files but eventually you'll get there and now we don't click the i icon we can but then we just see that this is um this text format to do the visualization in finch if you're on the eu server um you see this link that you can just click on uh if you're working somewhere else or you don't see this link you can also download this file and upload it to the main finch server yourself um and there's a little description here of how to uh have to do that um but for me i'm just gonna click here and you see that um it's now being loaded into finch i can view my data i see here my samples and the sequence reads so they're all the same because we sub sampled it to the same uh level earlier um so as a sort of normalization to do this comparison and then we can go proceed to gallery to view our uh generated plots so here just explore this add your own leisure a bit um here there's a bubble chart that shows you um the different um species available let's go to these doughnut partitions they're nice uh oh not available for this let's go back um just go to taxonomy bar chart so here we see um for each of the samples we see here um the composition of our sample sequences there's a lot to explore here so we can go to different taxomic levels here to view so we started out in the phylum levels so most of this are um these two um phylums of bacteria if we go all the way to came then we see that 100 percent is bacteria because we selected for that of course removed everything else if we go down to gene as you see here at the the different composition tables and yes just explore this a little bit this is a nice tool for comparing your different samples if you have any questions about this feel free to ask in chat and this is the end of our tutorial so just a recap of what we did today so we started by uploading our data of course we did a lot of quality control based on the the length and based on the quality um then we did an alignment and it's more cleaning after that we removed anything that wasn't bacteria because we're not interested in those um we can assess the error rate based on our mock humidity so I skip this but you can go back and you can do this if you're interested then we clustered our data into otus we classified these and then we analyzed these different clusters for alpha and beta diversity and then we visualized with finch and corona so the key takeaways of this tutorial are that 16s rRNA sequencing is very nice for taxonomic profiling it is relatively cheap and it's well established there are many tools and algorithms out there and reference databases just be aware that your choice of algorithm and reference database may bias your results um as with any analysis quality control is a crucial step to getting good results so that's why we spent a lot of time in this tutorial doing that also the mock community can serve as a control for your for your experiment and helps you assess the error rate and finch and corona are two very great um visualization tools for exploring your final results if you want to do more reading there are some references listed below here and as always if you have any feedback about this tutorial please fill in this form here so this is really about these materials any other feedback about this video about this week's course you can do in Slack yeah so thank you all for joining and ask us all your questions in Slack bye