 Hello everyone, I'm Delphine and today I'm gonna guide you through the Using Data Set Collections tutorial to explore what we can do in Galaxy with collections. So the big advantage of using data sets collection instead of single data sets by themself is instead of manipulating 10 data sets every time you're gonna use manipulate only one object all along your analysis. So to get started, we're gonna download a set of files that are issued from sequencing and we're gonna copy the URL that are in the training tutorial and we're gonna go into your favorite Galaxy instance, create a new history, name it however you want. Here, collection tutorial, save. And we're gonna upload data, use the Passage Data button and here you're gonna pass the URLs that you copied from the training material. In order to be sure that we're using that Galaxy is detecting the right format, we're gonna specify which type of file we are using. Here we're using fastqsangr.gz, you're gonna start the imports and close the window. We're gonna wait a couple seconds for it to load. Now that the data sets are downloading, we're gonna organize them into a collection. So we want to organize them into what's called a list of data set pairs. Why pairs? Because we notice that we have underscore one and underscore two which in sequencing experiments signify forward read and reverse read and both files are generated for each sample. So the way the file is named correspond to the name of the experiment, the name of the sample and finally reverse of forward read. So in order to organize these data sets into a collection, we're gonna select them using the Select button and Select All. And then for all eight selected, we're gonna build a list of data set pairs. So Galaxy detect automatically the syntax of the name file, but in order for you to learn how it's done, we're gonna pretend it didn't detect it. So we're gonna unpair everything, clear the filters and fill it manually. So the first one, we want to have eight unpaired forward file and we're gonna specify that the extension for the forward reads is underscore one. Same way for the reverse read, we have eight unpaired reversed file and we're gonna specify that underscore two means that we have the reverse reads. And you can see that filling this fields, Galaxy detect automatically that with the same name and different extension, these two files are supposed to be paired together. So we're gonna pair all four pairs of data sets and we're gonna name our collection M117 dash collection for the name of the experiment. We're gonna hide the original element because we're not gonna use them individually in the rest of the analysis. But if you need to use them for other purposes, you can decide to keep them in history. We're gonna create collection and we see here that instead of eight data sets, we end up with only one object to manipulate. You can see that when you click on the collection, you're gonna find your four sample and for a sample, you're gonna have your two file for forward and reverse reads. Here in fastqsangle.gz file format. You can either come back one step into the collection or directly to the main history with these shortcuts. And now that we have a pair of sequence collection, we're gonna upload a reference genome to align them to. So you can find the real for the reference genome in the tutorial 20 material. Here chromosome m.fastqsangle.gz. So we're gonna copy the URL and as we did previously, upload data, past page data, copy the URL. And this time we're gonna specify that we're using a fastqsangle.gz file format. Start, close, and we're gonna wait a couple minutes for it to upload. Now that we have our paired read collection and our reference genome, we're gonna use the tool BWA-Mem to align our reads to our reference. Now to find that tool in the toolbar, we're gonna use the search field and tap BWA. And we're gonna use map with BW-Mem. So you can see here the tool form and the first parameter that we're gonna change is instead of using a bit in genome index, we're gonna use a genome from history, a bit the index on the go. So we're gonna select the file that we just uploaded, chromosome m.fastqsangle.gz as fastqs, we're gonna change, we're not gonna change the algorithm used by BWA. And for the paired read, we're gonna specify that we're using a paired collection and enter the M117 collection that we prepared earlier. We are not changing any other parameters for this tool and we can click on run tool. We're gonna wait a couple minutes for the aligner to run and then we're gonna look what the output look like. Now that our alignment is done running, we can take a look at what the output collection looks like. So we notice that we have a collection with full data set this time. And the reason why is because BWA-Mem take as an input one set of forward read and one set of reverse reads and output one alignment for these two inputs. Since we had four sets of paired reads, we have then four alignments organized in one collection. So each of these four alignments is a binary BAM file. And we're gonna use these alignments to detect variation between the reference and our samples. In order to do that, we're gonna use the tool call variance with low frack. So we're gonna type call variance in the research bar and select call variance with low frack. So you can see here that it doesn't find one data set and that's because we need to specify that we're gonna use a data set collection and we're gonna use the collection outputted by BWA-Mem. As previously, we're gonna use a genome from the history and select the faster file that we have uploaded. And we're gonna call genome across the whole reference and we're gonna select both SNVs and Indels. We're not gonna change any other parameters and we're gonna run low frack with this set of parameters. Run tools and we come back in a couple of minutes once the variant call is done running. So this variant call tool output file called VCF and we can take a look and we see a number of information on what is the quality of the run as well as a list of variance detected, the opposition and different information for each variance. So what we wanna do now is organize them in a tabulated file which is easier for manipulation and for subsequent analysis. And to do that, we're gonna use SNP SIFT tool and we're gonna use SNP SIFT extract feed from a VCF. So as we did earlier, we're gonna specify that we wanna use a data set collection and we're gonna use the covariance on collection 11 that has been produced by the previous step and we're gonna select the field to extract using the line specified in the training material. So you can select the line, copy it and copy it in the field to extract. We're gonna select one effect per line. So once we have changed all these parameters we can run the tool and we're gonna extract the variance from a VCF into a tabulated file. Okay, now that we extracted the files into tabulated files we can take a look and see that we have indeed one line per effect in addition to one header line specifying what each column correspond to. So we have one file per sample but in order to run a further analysis most of the tool we only used one file with different sample for each line. So in order to do that we're gonna use a tool called collapse collection into a single file. You can go to the search bar and look for collapse and here we're gonna select collapse collection into a single data set in order of the collection. So we're gonna select a data set collection which is the SNP SIFT extract field output. We're gonna keep one header line to keep the information on what each column correspond to and we're gonna prepend the file name so that for each line we have which sample the variance belongs to. So we're gonna use prepend the file name on each line and in the same line than the rest of the data. So once we've done that we're gonna run the tool and we can see that instead of a collection it output only one single data set. So we're gonna wait a couple of seconds for it to run. Now that it's done running let's take a look at the resulting file. We click on the I and we can see that we have all the variance from our four sample collapsed into one single file and for each line we have the sample the variance has been identified in. So in this short tutorial we've seen how to create a collection, how to use it in different tool form and how to manipulate this collection to collapse into a single file. A lot more operation are available. You can look at the collection operation to have an idea of all the different way you can use collection and manipulate them to fit your needs. They are described in more detail in the rest of the tutorial. It's a great, collections are a great tool to facilitate your work and to declutter your histories. And if you have any question on how to use collection and how to optimize your work we are available on the Galaxy Project Training Channel for any question you have. And I hope you have a great experience for the rest of your training event. Thank you and have a good day.