 Welcome everybody. My name is Saskia Hilteman. I am a bioinformatician from the Erasmus Medical Center in Rotterdam in the Netherlands. And today I will be walking you through the 16S microbial analysis with mother tutorial from the GTN. As you can see, we're going to do the short version of the tutorial. And there's also a extended version. And these two tutorials do the exact same analysis. Except in this tutorial, we will run it as a sequence of small workflows. But in the extended version, we really run every tool manually and really go into the details of all the parameter settings. And actually at the beginning of each section in this tutorial, you can switch between the short and the extended tutorial. So if there are any of these sections that you want a bit more detail on, I encourage you to click this link and look at the extended tutorial. But for this video, we will use the short version. It's a little bit less clicking and we can focus more on the concepts and the results. Okay, so this tutorial can be found in the Metagenomic section of Galaxy of the GTN. Here I will be using the US Galaxy server, but you can use whichever Galaxy server you want. And I do recommend that you access the GTN tutorials via this little hat icon in Galaxy. Because that way you get some nice integrations where you can click on a tool name in the tutorial or workflow to load it directly into Galaxy. So you don't have to search for it in the sidebar. Okay, so what we're going to do in this tutorial is we're going to follow the standard operating procedure, the SOP for MySeq data, as it was developed by the creators of the Moder tool package and the Schloss lab. And this is the overview of what is in this analysis. So we're going to start by getting our data into Galaxy. We are going to put it into a collection to make a little bit easier to work with larger numbers of samples. Then we'll start off with a quality control step. So we want to really make sure that our data is of high quality, that we clean it, that we remove any low quality data so that we can really proceed with our analysis with only high quality data. Then we will do an alignment step. We will remove chimeras, so we'll do a little bit more cleaning based on this alignment. We will do a taxonomic classification, and then we can remove some more non bacterial sequences because we're only interested in bacteria for this question. We can use a mock community to assess the error rate of your method. We will do clustering, diversity analysis, and finally some visualizations with corona. So this is a 16S analysis. So here we sequence the 16S ribosomal RNA gene in order to classify the organisms in our sample. The reason we use this molecule is because it is present in all prokaryotes. It has highly conserved regions as well as highly variable regions, which is depicted here. So we can use the conserved regions to target this gene across different organisms. And then we can zoom in on these variable regions to distinguish between different organisms. So in this tutorial, we will use the V4 region for this. You may see some other regions use or combinations of regions for classification. It's also a very established method. It's been around for a long time. There are huge reference databases. Everything has been polished. So it's a pretty robust method. Yeah, so you can read up about this a little bit more some links here. So then the data we are going to analyze also came from this law slab the creators of mother tool package. And they describe their experiment as follows. They were interested in understanding the effect of normal variation in the gut microbiome on host health. So they collected pieces from mice on a daily basis for a year. And during the first 150 days post weaning, nothing was done to the mice. They just let them eat fat and big Mary and they compared in this tutorial we will compare the first 10 days to the last 10 days. So yeah, here's this depiction. So this is a lot of data samples so we follow one mouse. First 10 days and then some days at the end. And one more note about the naming schemes all the files will are named like this when we import them. The first bit of the name is indicates here female three so female mouse number three on day zero. Because we did paired in sequencing. Each sample will have two files to fast key files, one with the forward reads with this underscore R1 in the name, and one with the reverse reads with underscore R2. Okay, so that's our data. Let's get into galaxy first. So there are several ways you can do that. So the standard way is we all our data is in Zenodo. So you can expand this box, copy it here, and then paste it into the upload data tool here on the left of your galaxy. Under paste fetch data, you could paste the files there. I'm not going to do it this way. I'm going to show you another way. So by the way, make sure you have a fresh history that is empty. If you don't have that now hit this plus button first and give your history a good name. So I'm going to call it 16 s with. All right. Okay, so the other way you can get data for tutorials. Did you not. I'd save 16 s with mother. Save. There we go. Okay, so the other way you can get data for tutorials into your galaxy is under shared data. So the big galaxies have this under shared data data libraries. They provide all the input data for all the GTN tutorials. So this is a little bit faster than importing it from URL and it also it doesn't count against your quota in galaxy. So it saves you a bit of space to. So here you see, and there are all sorts of. Of data sets they offer here lots of public data sets or reference data sets. We are going to search for GTN. And then you see here this GTN material. Inside this data library, you will see that it's organized by topic the same topics as in the GTN. So we are going to go to the metagenomics section. And then inside here we will see our all the tutorials from that topic that have data here. We're going to go to 16 s microbial analysis with mother. There may be different versions of the data set if the tutorials ever updated with the new data sets. You will get multiple versions here so for today we're just going to use this one with the highest number. And then we see here all the files we see here indeed files named like F3D0 underscore R1 FESQ. We're just going to import everything. So we have here data from the early time points and data from the late time points. And we also have some more reference data. So 46 in total you can see at the bottom here. So let's add this to our history with this button at the top. You can choose to make it a collection or data sets, but we're going to import it as data sets. And then we will create our collection with only our FESQ files only our sample files later. So this will take a second to load. You can choose which history to send it to. By default, it'll be your active history. So that's usually what you want. Here we go. Or you can create a new history. So hit import. And then we can start analyzing. Okay, so you should see this green box pop up when it's done importing. And then you can click there to go to your history or you can always click this home button to at the top. Okay, so you see it imported 46 files. So some of these are reference data sets that we'll use later for classification and alignment. And then if you scroll down, you see our FESQ files of our mice and we also have the mock sample to FESQs. So we are going to take all our FESQ files and put them in a collection to make it a bit easier to work with large amounts of samples. So we just analyze the collection once and we don't have to repeat this for every sample. So we're going to filter our history for the FESQ files to make it easy. So you can do this by typing anything in this box. So we're going to type FESQ. You see it filtered our history. And then we want to select all these FESQ files. We're going to take, click this checkbox icon and then choose select all on the right. And then it tells us it has 40 selected. So we do indeed have 20 samples each paired in. So 40 files. Those are the mock sample and our real data sets. So we're going to take these and say with these 40, we would like to build a list of data set pairs. Then you get to this collection builder window. It will try to automatically find the pairs based on the names. You see it doesn't find anything now, but that's because it's looking for underscore one and underscore two. So depending on your sequencer or where you got your data from, there are different conventions. Sometimes it's underscore one and underscore two and sometimes it's underscore R1 underscore R2 like an RK. So you can just edit these values. And then you can say auto pair. And now you see that it's found the correct ones that belong together and it will name each pair with this sample name. So that looks good. So at the bottom, we're just going to give this collection a name. I'm going to call it mouse samples. You can call it whatever you want. And then we hit create. And then we can exit out of this selection. So it also detected that we did something now with it. So I think it's going to change your view again. But we're still have it filtered now at the top here so we can remove this filter for FASQ by clicking on the X and now all our data is visible again. So you see now that we have now, this is all our reference data and our FASQs have now been moved to this collection. And if you click on this, it looks as one item in Galaxy and you can treat it as one input, but behind the scenes it's really a list of 20 pairs. If you click on one of these, you see that each of those pairs again has two files behind it, the forward and reverse reads. And if you look at one of these, you see these are FASQ files. We can look at them in the middle, but it's pretty standard. The read name, the sequence and the quality score. Okay. So we're going to go back to our main view with this button up here. Okay, so now that we've got our data into Galaxy, we can move ahead to our analysis. And we are going to start by doing some quality control. So we do also have a dedicated quality control tutorial. So if you're interested in these concepts, I recommend you look at that. Usually the first step of any analysis because you want to be sure that you are working with high quality data that you throw away any noise, any low quality data before proceeding the real analysis. So one thing we could do in our case is we can combine these paired reads to create contigs. And we can also use this to improve the quality and that is because we are targeting the V4 region. We know this is about 253 base pairs long. And our sequencing is about 250 base pairs long. So we take this molecule, this V4 region, we sequence it from both ends, 250 base pairs. So most of this will be overlapped between the forward and reverse reads. And we can use this overlap to also improve the quality. So it's depicted here in this image what happens. So here you have this forward read, and you find overlap with this reverse read so most of it is overlap and there will be a few bases on either end that do not overlap. So for those, you just, and from these two, you make the consensus sequence. So if there's no overlap, you just copy the only information that you have. But if there is some overlap, so for example here in this position, the forward read called an A with a confidence of 10. So the higher this number, the higher the confidence, the reverse read agreed that it was an A and was even more confident. So in this consensus context, we will say that this position we call an A, and we can increase our confidence because these two agreed. So the same for these other positions that agreed. And then if there is disagreement, so for example here, the forward read called a T in this position and the reverse read and a, but this T was called with much more confidence than an A. In our context, our consensus context, we call this a T, but we lower our confidence a little bit because of this conflicting call in the reverse read. Sometimes it can happen that both of these are low quality. For example, if this T was called with a conference of five and this a also five, then we really don't know what to call this. Put an end here and no call, and we call that an ambiguous space. So we're really not sure what happened there. So the tool that does this in mother is called make contigs. So it will look at each of these pairs, find the best overlap and then generate this consensus contig. So let's run this tool now. So if you started your this tutorial via galaxy by this hat icon here, you can just click on this tool. Otherwise, you have to look for the name here, searching for tools so you could write the tool name here and then it'll pop up and you click that. So however you get there make sure to start the make context tool. And then as input, we are going to give it our collection of sample pairs. And the rest we will leave to the defaults. So here at the top, it asks us the way we want to provide these fast queue files so if we only have one sample, we can just give it a forward and a reverse file. So maybe we have a single pair, but in our case we have multiple pairs we have a collection of pairs so we're going to choose this combo mode. Then it detects this is our only collection, paired collection so it'll automatically select that one, and we leave the rest to defaults and then we just hit run. All right so while that runs. We will move on. And the next step we can do is we will do some data cleaning. So this is where our first workflow comes in. So we can do some cleaning based on this on these context on this data. So we know that the region we targeted is about 250 base pairs long. Any of our contigs are much longer or shorter than that. And then we know there's probably something wrong either with sequencing or with finding this overlap. So we are going to say, we're going to use the screen seeks tool to do this filtering. We are going to say for example that any context longer than 275 base pairs should be removed because they're probably not useful. We can also remove low quality contigs so ones that had a lot of these ambiguous bases. We can set our own thresholds in in this screen that seeks tool I think we're going to allow no ambiguous basis are going to be very strict. But that will be our quality control. So what we will do is we will do duplicate sequences because of the nature of this experiment so we're targeting this 16S gene across all the organisms in our sample. And because we will have many of the same organisms with the same 16S sequence there we will have lots of identical sequences in our samples is totally expected and fine. So we don't necessarily want to continue our analysis with all of those. For example, if you're going to align our sequences to the reference database in the next step. It doesn't make a lot of sense to align the identical sequence 1000 times. We need to do it once and then if we just know that this one sequence represents 1000 sequences in our sample that occurred 1000 times. Then we can still draw all the same conclusions. So that is what accounting sequences does. So first we get all the unique sequences and then we just remember in something called a count table. We remember how many of these unique sequences, how many duplicates were found in each of our original samples. All right, so before we start this we will have a look at the output of make context, but we will do that as soon as it's finished. So now that make context has finished, we can have a quick look at the files it created. So you see here it produced this trim.contextFastA file so it's no longer FastQ but FastA. So just the sequences, the quality is kept in this separate quality file. It also has some scrap.context.qual and FastA file so if it failed to produce a Context at all, if it didn't find any overlap, then it will discard those pairs and they will end up in this file. And let's have a look at this file. So yeah, this is the FastA file. You see now that we started with a collection, but now we only have one FastA file. So what happened here was all the sequences from all the different samples are now in one FastA file, which is nice. It makes it easy to analyze, especially if you're not using Galaxy, it makes it easier to analyze one file rather than 40 different ones. But now you have to still remember where each read came from from a sample in order to say anything later on in the analysis. Yeah, so that is where the group file comes in. So you see here that we have 152,360 total sequences across all our samples. So this is, these are the read names, there's also renamed the read names a bit to make it easier so they're a bit shorter. So this is the group file that it makes. Now is a two column file, which has these read names, and the second column is the sample name that it came from. So during the rest of our analysis, we will pass on this information, we will give it the FastA file and the mapping between reads and samples so that we can use this in our analysis. Okay, so then we are going to start our first workflow to do this data cleaning. So if you are using again this GTN and Galaxy function, you can just click this button here. And this is a nice new feature that will automatically import the workflow to your account and start the run menu. So just one click and you get, you get this. And I will ask you for the contigs. So just be careful here by default it selects the scrap contigs. But we want the trimmed contigs and not the ones that we throw away. And also the group file. So we can run this workflow now. And just to show you the scrap contig FastA is empty. So there was nothing thrown away. So we are happy with that. Okay. So give this a moment to run. Actually, if you aren't using GTN and Galaxy, if you have opened it separately, I can also show you how you can import that while we wait. So here I have the tutorial open in a separate browser window. If you now look at this hands on to import this workflow you see that you can now not click on this anymore. But below this you see here the workflow link. So you can copy this link you can copy link right click and copy it and then you go into Galaxy. And you go to your workflow menu at the top. And then you will see all the list of all the workflows you've used before, and then this import button at the top right. You can just paste the URL and hit import to import your workflow. For the rest of this tutorial, I will use the one click button. But if you have a different setup, this is the way you can get these workflows into Galaxy the other way. So while we are here and while we are waiting, I will show you how you can see for each workflow what sort of happens behind the screen. So we go back to our workflow list. Yeah, once you've imported it if you want to run the workflow you just hit this play button. So if you do use the one click option you don't have to do that but if you've imported it will appear in this list and you can use play to get to the run menu. But now if we just want to see what is in our workflow, we can click on the name here, click edit. Next to the workflow editor. So here you see really the overview of the workflow we can zoom in a bit and out in the corner here. So each of these boxes is a tool. And these ones on the left are input data sets. So you see that we take our context and our group file. We run it through a tool called summary seek so this is just getting some information about our data set. We do hear the screening the quality control, and then we do another summary so then we can easily compare our data set before and after. And then after we do the cleaning we do this deduplicating step we get the unique sequences, and then we also do the counting tool, which tracks how many of each of these unique sequences was found in each sample. So this is always nice to, if you want a little bit more information and if you click on one of these for example, you can see on the right here you can also see the parameters that are used for that step. So here you can see for example that the minimum length for context is 10, and the maximum is 275 so anything that is longer than that will get thrown away, or shorter also. And indeed the max ambiguous basis is set to zero. So it's quite strict. So if you're finding after the step that you are throwing away too much of your data, maybe one increases a little bit to to allow a couple of ambiguous bases there. But that's really up to you and depending on your experiment and your research question. Okay, let's get back to galaxy. So it's asking me to confirm to leave this page because I moved some of these boxes around, and I didn't save my workflow but that's fine for us. So, I want you to wait till that is workflow is finished and then see if you can answer this question from the tutorial. How many sequences were removed during this cleaning step, the screening step, and how many unique sequences do we have in our clean data set at the end. So, if you want to try and find these answers yourself, please pause the video here and then resume when you're ready to hear the answer. Okay, so once the workflow is done, let's have a look at the outputs. Okay, so remember here that we did a summary before and after this screening step. You see here already that it produced two outputs, good fast so these are the sequences we're keeping. So you see that we're keeping 128 sequences, I think we started with some 950,000 so we've removed some. The exact number so in bad back nose. These are the bad or the removed contigs. And if you open that file in the middle. You can see more detail so here this is two columns. The first one is the contact name. And it's the reason why it was removed. So you see here that this one underscore F3D0 that contig was removed because it had too many ambiguous spaces. For example, this one was had too many ambiguous spaces and was the wrong length was too long. So then you can really see what was removed exactly. And if you want to summarize that a little bit more because this is a big file with thousands of lines of course you can look at the summary output. So for this. Let's have a look at the summary the log put log file output. If you scroll to the bottom here you see here it's sort of summarized. You see here at the bottom, the total number of sequences we had before screening 152,000. So this sort of quartile gives some characteristics so you see here the length for example so most of these are really around 252, 253 which is exactly what we expect for the V3, V4 region. So for example the largest contact was 502 base pairs long so that probably did not. The contact did not assemble well did not overlap well, and the shortest one is 248. So actually you see here that most of these have zero ambiguous spaces, but the top 2.5%. The worst 2.5% had six ambiguous spaces or more and the very worst one had almost everything ambiguous so this is really this. That really are one pair that did not overlap well so we'll throw that out. And I can compare this to the after screening the summary file. The nicest way to do that is using this scratch pad at the top of Galaxy. So if you click this, it will stay yellow and have the check mark. And now if you click on an icon so we're going to do it for summary seeks before and after screen seeks you see it opens in this little window on top of Galaxy. And we do it again here. And you can move these around so that you can really sort of side by side compare. And we'll go to the bottom for these two. We see that we started with 152,000 reads. After filtering we have 128,000 left. You also see that all of them now have zero ambiguous bases, and the longest contact is 270 no longer 502. So you see here the effects of the filtering, and we still have a lot of reads left over so we are happy with this. If you threw away like more than half of your reads here, maybe be a little bit less strict or check that everything went okay with your sequencing in the wet lab. Okay, we can close these windows again and disable the scratch pad by clicking on it again. And let's see what else do we have here so the unique sequences. So here we see that of these 128,000. There are only 16,000 different sequences. The rest are all duplicates. Again, that's very expected for this 16S analysis. And then this output file we can look real quick. Here you see it's remembered that this read is representative for all these other contents that are identical. So this is really also 16,000 lines for some reason it's only showing the first three. It's only showing the first so many megabytes I guess. But yeah, and it will remember what each of these unique sequences or representative sequences they call them, which other ones are identical to it. And this is summarized further in a count table. So if we look at that. We see the breakdown of each of these representative sequences and how many times it occurred in total and how many times it occurred per sample. So for example, to underscore F3D0, this representative sequence occurred 4,402 times in total, 370 times in sample F3D0, 29 times F3D1, etc. And then during the rest of our analysis, we are going to use our FASTA file and this count table so that we can really know any analysis we do on these unique sequences. We can map that back to how many were in each sample. Okay, and with that we have finished our first data cleaning section. So now we can move on to sequence alignment. For sequence alignment, we also have a dedicated tutorial. So if you want more background on this, I encourage you to check out that tutorial. So we're going to start by aligning our sequence to reference alignment. So this isn't always done in these types of analysis. But the creators of the mother package have a nice paper here where they showed it does improve your O2 clustering. So we're going to follow their recommendation. So we're going to use the align.seqs tool to align our unique sequences to reference genome. So we click on that tool again. And you see in the start we also imported this silva.v4.FASTA file. So that's the reference genome for the 16S gene. Here we sub-sampled that reference to only include the v4 region to speed up our mapping because we know they should only map to that region anyway. But you can also use the full reference there. That's fine. So again, it'll ask us for what we want to align. So the unique sequences, the FASTA output. And then it asks about the reference genome. So it says your cache reference. That means some galaxies may have some preconfigured references for you that you can use. But you can also change this to your history and provide your own reference. So we're going to change this reference to the silva.v4.FASTA that we imported. And the rest we will leave to the defaults and we will run. Okay, so once that is done, just have a look at the output and see what you see. And you can do another summary after that as well. Because if you do summary.seqs on an aligned file, it'll give you a little bit more information about where the alignment happened in the reference. And we can use this to really, we can filter some more than for any reason, did not map nicely to the v4 region. We can get rid of those. Okay, so let's wait till this is done and have a look at the output. Okay, so after it's done, let's look at the align output here. So you see this is again a FASTA file. But now you see all these dots. And if you scroll to the right, you see there's a lot of dots. So you may think what's going on here. But if you scroll to the right further, you will start to see some alignment. So we have to get really to the v4 region of the reference first. And then we can see that here our reads start to align. But of course, this is not a very nice format to look at. So there's also a little bit summary information in the align report here. We'll have a quick look at that. So here you see again per contact, you can see a little bit the length of the contact and some information about how it aligns. So here the score. So there was a perfect match found here. And some of them did not have a perfect match, but that's still okay. And you can see a little bit where it stopped and start. Okay, so based on this alignment, we should just summary dot six. Yeah, so I'll start that. So in the same way we got information about our FASTA file before summary dot six if we provide it with an aligned FASTA file, it will also give us some information about the alignment. And yeah, we now will always give it a FASTA file and account table so that if it removes anything at any point, it can remove it from both these files so that they stay in sync. So run the tool. But while we wait, we'll just see quickly in the training manual to see the output that we will get here. So here we see the number of unique sequences now listed and the total number that it represents. So we see here that the 16,000 unique sequences represent 128,000 total sequences. And you see here that most of them align pretty well to the V4 region. So we know that the V4 region is between 1968 and 1550 on the 16S reference gene. So most of these align pretty nicely over the region. I see again that some outliers that don't overlap quite as nicely either on either end. So again, we can use this to do some more filtering. You also now see this polymer, homopolymer column. So this is the number of homopolymer stretches that the largest that it finds. So we know that in our V4 region, there is no more than I think a stretch of five, the same nucleotides expected in a row. And sequences are notoriously bad at resolving these homopolymer stretches so they have a hard time when they encounter something like that to confidently say whether they saw five A's in a row or six or seven or eight. So if we see here too many that that might just be a sequencing error. So again, we will throw that away so that it doesn't confound our data. So yes, based on this alignment we will in our next workflow we'll do further cleaning up. So this is what will happen. We will remove any reads that do not overlap the V4 region nicely. We will remove any overhangs on either side again that do not overlap the V4 region. With screen dot six and filter dot six. We will also simplify this alignment file a little bit by removing columns that have a dot everywhere. And just remember that we're looking at the V4 region. Again, anytime we filter or clean up, we might be introducing new duplicate sequences so we want to run the unique command again. So for example, if we had two contigs that were identical except for the last base, and we now remove the last base, then the resulting contigs are identical again and you get some duplication. So to avoid this we run unique seeks again. The last cleaning step here is pre clustering. So any two contigs at this point that differ by only one or two bases are more likely to be a result of sequencing error than real biological by variation. So we're going to use this pre cluster tool to sort of map these together and treat them as one. And then the last step in this second cleaning workflow will be chimera removal. So cameras are some are sequencing artifacts that can occur during PCR. The word chimera comes from Greek mythology, and it's used for an animal that is like a hybrid between two animals so for example the head of a lion and the body of a bird. And this can happen during PCR where something goes wrong and as a result to unrelated sequences sort of form a hybrid sequence. So we can detect these things and remove them to improve the quality of our data. So that will be the next workflow. So let's click here on data cleaning camera removal. And then it wants us to provide it's with the aligned sequences and the count table. All right, so it looks like I already selected the right ones so the align file from align dot seeks and our most recent count table. Okay, so while that runs, when it's finished, see if you can answer these questions from the tutorial. So see if you can find out how many of these cameras were found and how many sequences remain after the second round of cleaning. So pause the video now if you would like to find these answers yourself and then resume when you're ready to discuss the results. Okay, so let's have a look at the results. So we see here that we look at the cameras first so come here to research identifies cameras. So here it will list how many found so 3441 of the unique sequences were determined to be cameras. And we can check later how many of the total sequences that represent. And then you see that in remove dot seeks, it did the actual removal. And we can again did a summary after all this and before. So if we compare those two again, so we can take here the scratch pad, we can look at the last summary dot seeks log file. Compare that so before this whole second round of cleaning, we had 16,000 unique sequences, which represented 128,000 total sequences. And if we look now at the log file at the end here and we scroll to the bottom, we see that after filtering. We now have 2000 unique sequences so that really sounds like a big, big reduction from 16,000 to 2000 something. But if you look at the total sequences that represents you see that we only lost about 10,000. So a lot of these sequences were were unique, where I did not have duplicates, and that's expected for things like cameras that they only occur once exactly that way or things that were sequenced very poorly and other artifacts. So in fact this is this looks quite good. And the other change you see here between these two is. You see now also that the homo polymers have been filtered so the longest homo polymer stretch in our data is now eight, which is also where we set the threshold and screen seeks instead of this 12. Okay, so now we have a very clean data. Let's see, for example, the pre clustering output just so you can see real quick what that looks like. So this is again a fast file, but now has some alignment and we got rid of this long line of dots at the beginning because they were dots for all the, all the different context so this again simplifies and reduces the size of our data. Okay, I'm going to uncheck the scratch pad before I forget. Okay, so let's move on. So now our next step we can get around to taxonomic classification. We will do this using a Bayesian classifier using the classified seeks motor tool, and we will use a training set that was also created by the developers of mother and prepared in the mother format and this is based on the RDP project the RBSOMO database project, which can read about more here. So before we begin a quick remark about taxonomic assignments so there are many different approaches you can use here. So there's a nice overview in this paper that is linked here. So there are lots of different algorithms and lots of different approaches. And also lots of different reference taxonomies that you can use. So green jeans RDP and CBI are some of the most popular ones. And just, yeah, the thing to remember about this is that what you choose here does impact your results so you should carefully figure out what is most suitable in your situation. So in this paper they also compare this nicely. So here they had three different data sets, each of these graphs is a different data set. And different approaches and reference taxonomies. And if we look here for example at the mouse data set. If you use green jeans, this here in each panel is the relative abundance. So you see here that you really only identify two species at pretty consistent abundance. And on this axis by the way is different V regions. So you see here if you use a different approach or different database, you will get very different results. For example, here you find much more richness in the same sample. And also if you compare a left to right within one of these, you see here that depending on which V region you sequence, you are going to find different results. So all this is to say that which method here is the best really depends on this type of sequencing you did the type of community, you are expecting to find and your research question. So do a little bit of research here. There's some good resources here to really think about what is the best suited for you. There's no one size fits all solution unfortunately. So there's another nice discussion linked here as well. Okay, so we will now classify our contigs. And once we do that, we can also do it like a final step of cleaning because anything that is not classified as bacteria. If you're interested in for our research question. And so we'll remove it so there might still be some 18s are an aging fragments or 16s fragments from our KIA chloroplast mitochondria that may survive all our previous cleaning steps, but that we will identify now and we will get rid of these. So let's run our next workflow tax number classification. We will load that it'll want from us. Our latest faster and count table. So these will be the output from remove dot seeks for remove the cameras. And then it wants our reference taxonomy and reference. So we will give those as well. So here remove seeks faster move seeks pick dot console already selected the right ones for me, but double check that. We're the faster file we're going to give it one of the files we uploaded at the start train set nine pds faster and the reference taxonomy. Okay, and then we hit run on that. So again, wait until that's finished and then see if you can answer this question. How many non bacterial sequences are removed. And again, try to find both the representative sequences that were removed and how many total sequences those represent. Okay, so now we will have a look at the results. So you can see here that it did the classification and then it removed some things. And if you want to see exactly what we told it to remove you can rerun this remove lineage tool to find out the exact settings that were used in the workflow. Or you could of course, let the workflow the workflow editor and find it there. So you see here that in this tool you can provide some taxons for it to filter out so we remove chloroplasts mitochondria. Anything that was unknown RK and eukaryotes. So now the question was, how many was actually also look at the taxonomy output, I think, maybe the tax summary. So here you see this file contains for every contact it assigned a taxonomy. So this is a kingdom bacteria phylum this all the way down to a species level or genius level. And then this other output from classified up seeks summarizes that a little bit so you can see here for every tax on. You can see here the split by sample. And now the question was, how many were removed these non bacterial ones so here we can again use the scratch pad and look at the summary before and after. So, let's find the one before. So before we did classification and removal that summary dot seeks log file, we compare it to our latest one after the removal. So side by side scroll to the bottom. So we see that we started with 2279 unique sequences. That's down to 20. So, that's by 20 so 20 and then were non bacterial. And in total that was 118,091 minus 17,999 total sequences that that represented. So again, not a lot of difference here but we cleaned our data a little bit further. Okay, so in the next section we will now show you how to use a mock sample to assess the error rate of your method. And this is an optional section. So if you're not so interested in this you can just skip to the next section without any problems. So everything we do in this section will not be used in the rest. But if you would like to see how this works, stick around for this section. So a mock community is a artificial sample that you sequence along with your real data in order to assess the error of your method of your experiment. And this mock community you construct so you know exactly which organisms are present and at which proportions. And of course you hope to find back after your sequencing and after you buy informatics analysis you hope to find exactly what you know was there. But if you don't you get at least a measure of how close you got. So usually you put some similar species in there to what you expect to find in your samples. So there is a nice paper on this links here in the tutorial as well. So this is an example they had this mock community on the left so they had 20 samples, each had equal proportions, and then they use different methods so they sequenced different variable regions, also with different methods. And you see that they got pretty different results so most of these detected most of the organisms but their proportions at which they found them, very significantly. And this is again to say that doing something like this, maybe trying different methods to see which method gets you closest to your known sample can really help improve your quality so if you can sequence this along with your samples. Yeah, there's some more some further reading on this if you're interested. So, in this in our tutorial data we also have one of these mock samples. So we are going to extract the samples first and then do an analysis on that to see how close we get to the expected results. So we're going to use get groups for that. Just to show you quickly back in our original set of samples you see here we have all our real data in pairs, but we also have this mock sample. And that was our artificial sample so we're now going to extract the data from our fast day file and fast Q file, sorry con table. All the data about this mock sample so that we can analyze it further. So get groups, and we are going to give it the count table and faster from our latest step where we removed the non bacterial species removed lineage. We're going to tell it to extract the mock sample, and we want to see the log output file. So, let's see. We have here count table from remove lineage and our faster file. So it's just loading this file and it's also looking inside this file at the group name so that it can let us choose here, which group or sample we want to extract so we want to select mock here. You can do multiple ones if you ever need. And then we also give it here the faster from remove lineage. And then the final thing is at the bottom we say yes to the log file. Again it's reading inside the file to offer us the group so this may take a little bit longer. But once we can we will click this and we will run the tool. Okay, so in the log file you will see that see how many sequences selected so 58 sequences representing 4046 total sequences. Next we will use the tool called seek dot error, which is made for for these sort of error estimation so we're going to give it our fast and count file of our mock community. And we're also going to give it as a reference we're going to give it the sequences that we know we put in there, and then we can estimate how close it got. So, okay, previous tool hasn't finished it or it's just starting now but we can already queue this tool. So let's click on their seek dot error. And we are going to give it. Yes, our fast file from get groups. We again want to give as a reference file from our history. So let's see. Double check. Faster. Yeah, okay. So we scroll down to the bottom and we choose here the HMP underscore mock. And I think the count table was the last thing and okay. So count table. Okay. Yes, count table also from get groups. And we run this tool. So, while we wait, I will show you quickly. What this mock reference looks like. So we scroll back to the bottom. This mock fellow, I still got my scratch pad on and we'll turn that off. Oh, actually we can see it fine here. So you see here this is just a fast file with all of our sequences for the species we put it. All right. So we're just going to wait for that to finish. But in the meantime, we can already have a look at the tutorial to see what will come out here. So once this is done, we will see some output like this in our log file. And this will show the overall error rate detected. So it's a pretty low error rate. So this gives us some confidence in our method. And then as the next step, we will now cluster these mock sequences into ot us. Our basically placeholders for it stands for operational tax on taxonomic unit. So you basically you cluster similar, similar sequences. And you can set different thresholds here so you can say, set a 97% identity threshold is often used. And this is sort of a proxy for a genius level similarity. You can also have a different or cluster it on higher identity if you want. So we'll use this as sort of a proxy for different taxons. So yeah, just groups of sequences. So we'll do this for our mock community. And then we can sort of do some more error control on this. We want to see also we want to reassure ourselves that we have captured the full diversity of our sample. So one of the things we can do here is generate the rare faction curves. So what this really is is we sub sample our data and say, let's say we didn't take all our data, but only use half our data, how many to use how many of these organisms would we have found. So what you want to see there then is ideally you will see something like this green graph. So you want to see that okay using all our data we found this many to use. If we had used half her data we also found that. So this also makes you think that okay if we sequence more, we're not really going to find any more diversity we've really captured everything we can. So if you see something more like this blue line, you see it's starting to level off so that's good, but if you sequence a bit more, you might still find a couple of additional to use a couple of additional organisms. So if you really have something like this brown line, then you really have not begun to capture the full diversity of your sample yet so if you sample more you are expected to find a lot more different to use different species. Yeah, so this is we are going to generate these rare faction curves so that we get a feeling for whether we sequence enough to capture the full diversity of our sample. Okay, so this is done in this next workflow. So, click on mock otu clustering. So, it hasn't finished for me yet, but that's okay we can already start this new workflow. Again, let's have a peek to make sure we do this right. So we're going to give it the count table on fast stuff from get groups. That one is right. And this one get groups faster. Okay, so let that run and then try to see if you can find out how many otu's we discovered in our mock community. And here we use the 97% identity threshold, which will be displayed as 0.03. So sort of a difference metric, the inversion of that. All right, so pause it here if you want to find the answer yourself and then resume if you want to hear the answer. Okay, so let's have a look at the results once it's finished. So the question was how many otu's do we find in our mock sequence. So we can find that in different files here so let's start here after clustering we can otu list. We can look at that. Okay, normally this sort of displays as a proper table but now it doesn't, but this is the header line so we see that we have 34 otu's. And they're just numbered one to 34. And then here you can see which contigs map to this otu. So we have one line here at this similarity level. So it's also possible to do multiple similarity levels at the same time with mother but we just care about the 97% identity. That's the standard. So 34 total. And then you see here, they ordered them from large smallest so most of them otu one has the most contigs as part of it and then down to one for these last ones. And let's look also at see there are faction curves. So here we did some sub sampling again you can do this for different measures will explain more about this later different diversity measures. But here you see that you can use all your data, like at the bottom here that we have 34 otu's. If we say use 3000 reads instead of 4000 we have 30 otu's. So you see the leveling off is starting so it's a steep line here and now we can also plot this if we want and we will do that for a real data set. Yeah, we see now the number of otu's and if we can compare that with how many organisms we put in our mock sequence so let's look quickly here. We can just expand it. So in our reference set so we had 32 sequences in our mock sample 32 different organisms. So we get pretty close to that so not exactly so there is some difference there but yeah that's quite common and it's still quite good. So with that we sort of convinced ourselves that we have a good method here both in the wet lab and the dry lab. So then we can now sort of redo this otu clustering, but for real samples. So that will be the next section. And so I'm just going to explain this again for those who skipped over the mock community section. We are going to classify cluster our contigs into operational taxonomic units. So these are just clusters of very similar reads very similar contigs that we use as a proxy for taxons. So we can do this in multiple identity thresholds. So it will commonly see a 90% 97% identity. So clusters that are at most 3% have a most 3% difference. And this is often used as a placeholder for genus level similarity, but you could also go up to, you could require more sequence identity to go to like species or strain level. So just remember that it is a little bit questionable how precisely you can go with 16S data because it is a highly conserved gene, and you're only looking at one gene. So usually, it's only really reliable up to genius level. And if you want more power there you should look into methods like shotgun sequencing to get more data and more power. So yeah, we're going to cluster our contigs into similar similar groups, and then we are going to do a taxon assigned taxonomy another classification on those. So let's do that so the next workflow will do that. It will first remove all the mock sequences because we only care about a real data from now on. So we're going to start this workflow, and we are going to give it our latest FASTA file and count table and the taxonomy that we determined all from remove lineage. And if you're careful here the default if you did the OTU section the defaults will be from, from their remove lineage FASTA count table from remove lineage and taxonomy from remove lineage and run. Okay, so this workflow clusters into OTUs. So it does a classification step on those OTUs to find the consensus or uses the consensus sequence. And we will also, that's the next section, the next section we will use these OTUs to say something about diversity. But let's first wait until this workflow is done and then we'll have a look at the outputs. So let's look at the outputs and then see if we can answer this question, which samples contain sequences belonging to an OTU that was classified as Staphylococcus. So you see here it made some new outputs. So we removed first the mock file. Then we clustered and classified. So let's look at the outputs of this classification. I think we need the summary. So that is a collection again. It only contains one, one file here and this is 0.03 meaning the 97% identity. So you could have also asked mother to do this for multiple levels at once but let's look at this one. So here you see all the different taxons that were identified. So how many times it appears per sample. So the question was about Staphylococcus. So we're just going to search for that. I'm going to copy it here. And then I'm just going to paste it in my browser with control F. So we see here that it does appear. And then you can see here that it appears once or zero times. So to know which files are we need to zoom out a little bit. Let's see if this is enough. So we see here that it appears, for example, in my in the right row in sample 141 142 but not in day one. And in hundred forty four hundred forty five and not most of the other ones so this could definitely be something that was introduced later in the mouse's life. In ten days. There was no Staphylococcus in the microbiome, but after hundred fifty days, it did pick that up. Yeah, so with this classification you can see things like that. So let's zoom back in. So the other files that it made real quick. So this is just the same classification but on the OTU level. So, not mapped back to the samples but for each OTU the size of the OTU and the classification so this is a really kingdom bacteria phylum bacteriodes and all the way down to genus or species maybe level. So now that we have this classification, we can use it for a diversity analysis so that'll be the next section. So there are different types of diversity analysis. So the first general sort of remark about diversity so this isn't something some constant that you can measure directly. So when we talk about diversity, usually we talk about a combination of different concepts. So it's explained a little bit in this background box. We have alpha diversity that really talks about the ecological complexity within a single sample. And if you're looking to compare samples you're talking about beta diversity. So this diversity is a valuable tool for describing ecological complexity of your sample. But it is really more of a combination of these three concepts so you have species richness, which is simply the number of different species in your community of different organisms. Then you also have the species evenness so that is also compares like the proportion of these different species so how many do I have each of these. And also takes into account phylogenetic diversity like how closely related are the different species in my community. So just a little illustration of this. So say you have community a here and community be which one of these has the highest richness. So again think about this yourself pause the video and continue when you want to know the answer. So remember that species richness was the number of different species in the community. So if you count them here you see four different species. And here actually see the same four different species. So these two have the same richness. So now for yourself would have to say which one of these is more diverse, what would you say. So you'd probably say community be, and that is because here, most of these, most of the community is dominated by one species by this yellow one. So diversity you already see also includes a little bit of the proportion of the different species. So that is evenness. So you see here that be is more even than a because it has more similar proportions. And so those are the two concepts that come into play but then also, you might think, well it depends on how closely related these species might be. So let's say you have these two communities now shown as a phylogenetic tree. So on the left here, let's say you have these red, red species are part of your community. And in another sample is the blue sample, which one would you say is more diverse. I would probably say community, the red community here because these are less closely related to each other, and then be so be has 123456 different species in it, and a only five. These the blue species are all very closely related this community, and these are more genetically different. So you would say this is more diverse. So when we talk about diversity, it is sort of a combination of all these different concepts. And there's no single best way to quantify this so many people have defined many different metrics. These are all named after very smart researchers who all had their own ideas about this diversity. And the other tools will you can have them calculate all these different metrics. And if you want to read some more about them, good papers linked here. And again this is a case of which want to use depends on your research question your experimental setup and everything. And so we will do alpha diversity first. So again, we can use this to generate sort of these rare faction curves. And here we want to see I will explain again for the people who did not do the mock community section. So these are a faction curves, you sort of you plot to your diversity as a function of the data you've used. So in the end here if you use all all the data of all your samples. How many, what is the diversity that you capture how many different species to use. Do you capture and let's say if you only use half of your data how many do you find. And now what you ideally would like to see is a curve like this green one, which says like, I do not find additional ot us, if I would sequence more. So it's already leveled off so you probably captured the full diversity of your sample. In this blue one, however, it's not quite leveled off yet but it's starting to. So here if you sequence more, more of your sample, then you would probably identify a few more species a couple more to use, but maybe not that much. So you, you are almost there. But if you have something like this brown line that it's really it hasn't started to level off yet so you've not begun to capture the full diversity. So if you want to re-sample, and if you would sample more sequence more, and then you probably find a lot of additional organisms there. So then you maybe want to reevaluate if you want to redo anything or sequence more for example. This time we will get these refraction plots, we will get some diversity metrics and then plot these on the basis of that. So let's start this workflow. And this time I will ask for our the shared file for make.shared. That is so I haven't discussed that one yet so that is another mother specific format. So you can look at this real quick. So this basically tells you for all the OT was how many were found in every sample so for day zero on this similarity level, how many, how big each of these OT is once how many and context map to each of those and repeated for all your samples. So this is similar to your account file except OT based. Okay, so let's do that again. So we're gonna do workflow number six alpha diversity and just give it the shared file that has all the information about all our OT use across different samples. Okay, make shared that's fine run workflow. While that runs we can already get a sneak peek of what we will see. So we got the rarif faction, and for every one of our samples. So we get there. The diversity metrics first at different levels, the refaction so we have here the sobs is the observed richness. So it's simply the number of OT use. So you can get different metrics here, and you can read about these in more detail if you want. So this is inverse Simpson index is some probability based diversity metric. So in a file like this, you can also ask mother to add more different metrics here. So there's a whole the whole list I showed you you can have all those calculated by mother for you. And here it's for all the samples so sobs is the simplest one so that's the richness, the number of OT use. So some things we can observe from this file are, we can see if we see any big differences between for example the early and late time points. So here you see nothing really obvious. So there's no big difference in diversity between mice on day zero and day 150. And then we also see that the coverage as most of the to use are well covered by multiple sequences. So that's also something we like to see. Okay, so if you want some more diversity information about diversity metrics, the mother wiki page has good explanations of what they all are and how they're calculated. So we could do more analysis, a statistical test like ANOVA to confirm this feeling that there's no significant difference between early and late and maybe male and female mice if we had that data but that is beyond the scope of this tutorial, but you could use the full data set from other if you want to look into those things. So after alpha diversity, let's see if this is finished now, almost. No, still needs to do some more. So I'm just going to wait for it to finish and then we can quickly look at the real files but we've already discussed the contents. Okay, so after it's finished we can have a look at all the output files. So I'm sub sampling to get the rare faction. And we calculated the diversity metrics on here so we can have a quick look. So this is just the observed OTS the number of OTS. Different sub sampling levels how many OTS you find and some samples had more sequences to begin with than others so then there's an NA here. So we can swap these. So we did that here. And if we look at that we see. Okay, the fonts are not working on this galaxy apparently. So what we can see here in the, in the tutorial what it's meant to look like, so I get to the right place here so number of OTS plotted there and number of sequences. So we see this leveling off happening, at least starting to happen for most of our sequences so that is reassuring. So we can sequence a bit more for some of these you'd probably find a couple of additional OTS, but the bulk is already captured. Okay, so next up we will now, because this was all alpha diversity so we looked at a single sample and wanted to say something about the diversity in there. So compare samples or groups of samples if we want to say is it more diverse in the late time points than the early time points. Then we need beta diversity. So here you have again the exact same situation where there isn't just one beta diversity metrics you have a whole, a whole list of different ways to calculate this. There are some nice links in here if you want to read up about the details. So we will calculate beta diversity and then we'll try to plot this a little bit so we can see if we can find some differences in our samples. So to do this, let's start, and this is our final workflow of this tutorial, the beta diversity analysis. So let's note that into Galaxy, we give it the shared file to make shared and our sample shared. Okay, make shared, and that is not the right one, so sample shared and run. Okay, while that runs, I can already show you the results because they're in the tutorial. So here this beta diversity calculator and how exactly that is calculated you couldn't read up about there. And then we tried to make some heat maps like this. So here you have all the different samples plotted and again on this axis. So you can see the darker the color, they're more similar they are. So you see here that this sample I think it's day four is very similar to, I think these are the other early time points so we could we can read this when it's generated in but here you so you can sort of see clusters of these sort of the late ones clustered together except for this one early one and the early ones also clustered together a little bit. And of course, if you use a different calculator you will get different looking results. So again look into what is most suitable in your situation. So another thing we can do is we can sort of represented by event diagram. So of course this won't work if you want to compare lots of different samples but up to say for it can be quite nice. So here we have compared for different samples. So the first four days. And you can see here that most of the otus are shared by all four of these. And there are, let's say between eight and 18 ones that are unique to one sample, and all the others are sort of shared by two or three of these samples. So again this can give you a little bit of a feeling for how similar these different communities are. So this number here that they share is very high they're very similar and if this is only a handful and sort of the numbers on the outside here are very big thing you can say these are very different communities with a little overlap. And the final sort of way we can represent this the final visualization here is in the form of a tree. So we can cluster this similar most similar ones together so you see that we do see here nicely this clustering between sort of the early time points on one hand and the late time points on the other hand. So you do see that there was some difference here between early and late, except maybe this this very first day, though that is, yeah, very high up here. So this is another nice sort of way to represent that and to compare that. Okay, and then the very last thing we will do in this tutorial is we will get a nice chrono plot of our community so we can really get interactive plot here. That shows us are the composition of our samples. So we have this crona tool in galaxy. But before we can do that we have to convert our taxonomy file that we got from the classifier to and converted to a format that crona can can understand and can plot. So we made a special tool for this in galaxy called taxonomy to Corona. So we're going to run that tool now. We're going to give it our classify.otu and note that that is a collection. So we have to change here from single data set it says it can't find anything if you click here on the data set collection option. You can select the taxonomy font classifier.otu. Okay, and that's all you need here and then you just run this tool. And then we'll already queue the next one then you can run crona on these crona format taxonomies. Okay, again this is a we need to change it to tabular first. And then I think it's it's another collection list will go on data set so again we click here on data set collection and then make sure we input this taxonomy to crona file. Galaxy is becoming a little bit slow for me all the Americans must be waking up I am in Europe myself. Yes, here if we do taxonomy to crona. Cool. Okay, you can wait for that to finish but there's also the plot the interactive plot is here in the tutorial as well. So this is what you will get. You will get here, the composition of your community. You can see here on different levels so you see that 100% are kingdom bacteria, and then you see you have here a couple different phyla. And if you want to zoom in on one of those, you can really double click those and it'll zoom in like that. And you can like look at different levels and see the relative abundance of all these types, or you can go back up to bacteria this way. This is a nice way to interactively explore your sample. Now you see that here it took basically our entire fast file. So you can't compare it by sample now this is everything together. And if you want to do it by sample. There is a nice. Exercise at the bottom here. So you can redo the classification step to give you a taxonomy file per sample. And then again you can convert this to crona format and run crona. Try to do this yourself as an exercise the full solution of how to do this is given here. So the solution, let's see if the crona plot is in here. So if you do this this way, then you see now that you see get here you can switch you get one output basically with different crona plots in it so you can here switch between different, different samples to see how the composition changed of your gut microbiome. So this is a very nice tool to interactively explore. Okay, so now you've finished the mother standard operating procedure for my seek data. Here's another overview. Yeah, so well done. The main things to remember here is that there are many different choices you make over the course of your analysis here can influence your, your final results so make sure you understand the differences and understand what is appropriate for your situation. Also quality control, as with any analysis is very, very important so don't skip over this make sure you have you clean up your data. Your data is high quality enough and if it isn't, maybe consider redoing some of the sequencing of doing some additional sequencing to really increase the power of your experiment. You can sequence a mock community together with with your samples. So, for example, labs I've worked with they did this with every group of this many patients they sequence them at once and also did a mock community as a sort of quality control step. Every time they ran it. Yes, you can explore alpha and beta diversity using different tools, such as Corona. Thank you all for listening. And let me know if you have any questions.