 In this video, I'm going to walk you through the tutorial, mutation calling, biogenome reconstruction and lineage clade assignment from SARS-CoV-2 sequencing data, which is part of the Galaxy training material. As always, you can follow this tutorial through the dedicated website training.galaxyproject.org or, and this is the way I'll show in this video. You can follow it from inside Galaxy by clicking on this little academic head icon here. And that brings up the training material site as well. So feel free to use whichever method works best for you. The tutorial that this video is about lives in the section variant analysis. And in there, you'll find mutation calling, biogenome construction, hands-on. So we're going to follow this tutorial. So the question that this tutorial will help you answer is how can you use Galaxy to extract variant calls from SARS-CoV-2 sequencing, sequences using Galaxy? And specifically, which tools and workflows are available for that? And how can you continue then from variant calls to consensus sequence building and to lineage assignment with, for example, Pangolin or Nexclade? So the Galaxy community has been very active and has come up with a set of workflows for almost any kind of input data to do just this kind of analysis. And so what these workflows combined will allow you is to do sensitive identification of SARS-CoV-2 allelic variants and produce VCF files with those variant calls. They will also allow you to generate more user-friendly reports for whole batches of results, assuming that you don't only have one sample of SARS-CoV-2 sequencing data, but that you have many such isolates. And then they are also enabling you to do reliable and configurable consensus genome building from the called variants and to call viral lineages from these consensus sequences. We really encourage you to bring your own data to this tutorial and use it here to follow this tutorial because the important point is that this specific tutorial is not some simplified demo, but it uses the exact same production-ready workflows that the big public galaxy servers also are using for their large-scale genome surveillance efforts. So this is really the real thing we're showing here. And specifically, you can bring data that's either single-end or pad-end Illumina whole genome sequencing data, pad-end Illumina sequenced Ampliconic data, so like, for example, Arctic Amplified data, or Oxford Nanopore Technology, ONT sequenced Ampliconic data. In this live demo, I'll use the example input data suggested in the tutorial though, which is publicly available pad and Illumina sequenced Arctic data from the Cog UK, so from the United Kingdom's SARS-CoV-2 genome surveillance program. So as always, before we start any analysis in Galaxy, it's good to create a new history. So to give the analysis, we're going to perform its own workspace, so to say, inside Galaxy. So what I do now is do this in my account. So you locate that little plus icon here on top of your history panel. You create a new history and then I'm gonna give it the name SARS-CoV-2 variant. Analysis. This is demo. In my case, I press Enter to have that name recorded and now I have an empty new history. So back to the tutorial. Next step is, of course, to get some sequencing data to analyze. As I said, these are the supported formats and I'm just going to use the data that's suggested as example data in this tutorial. There are many ways to get your sequencing data into Galaxy and so you're really encouraged to read through this section on the different ways to import data so you can get data directly from the SRA, for example, through a dedicated tool in this tutorial for that. You can upload directly through the Galaxy upload manager interface in the web browser. You can upload via Galaxy's FTP server. You can import data from a Galaxy data library if your admin has given you access to one with your sequencing data inside, or you can always import via publicly accessible links and this is what I'm gonna do here. So this is the suggested example data to follow this tutorial. As I said, it's paired and Illumina Sequenced Ampliconic Data from the Cog UK project. So for every sample, there are two reads, one forward, one reverse. So every record identifier here is appearing twice, once with underscore one, once with underscore two. And so I'm just gonna copy all these links. And these specific example data sets are stored on Synodome as we usually do for Galaxy training material, but the important thing is this would work for sequencing data for which you have any publicly accessible link. And yeah, we just copied that. And then I go to upload data, then paste fetch data, and I just put in all these links. And then for these specific data sets, I know that this is FastQ Sanger GZ format. Your mileage might vary, of course, if you brought your own data, but I'm gonna tell Galaxy up front that all this data is of type FastQSanger.GZ. So this saves a little bit of time because Galaxy doesn't have to auto detect that format then. And then I just press start and initiate the upload of all this data. So it's ongoing. Now you see that all these data sets are gray and waiting for upload. In the meantime, we can go back into the tutorial and see what else we will need. So once it's uploaded, we will create a paired collection out of this data because it's paired in sequencing data, but we're gonna wait with that until the data is ready. In the meantime, we can already initiate the import of some auxiliary data sets that we're also going to need. So one thing is the SARS-CoV-2 reference sequence because that's needed to map the reads we sequenced in our samples to the SARS-CoV-2 genome. First step before variant calling even. And then we need a tabular data set defining aliases for viral gene product names, which we will later use for reporting the effects of our variants. And there are two data sets that are specifically needed for the analysis of Ampliconic. So for example, Arctic Amplified Data as we have it here. And that's a bed file, one of them, specifying the primers used during the amplification process and where they bind on the viral genome. This is important to trim the primer sequences of the reads. This is done by the first workflow we're going to use, but it's important to have this bed file. And then an additional tabular file in kind of a custom format that just describes the Amplicon grouping of the primers. So which primers are supposed to work together to form a given Amplicon? And so if you're working with Arctic primers, then these files are already pre-prepared and publicly available for you. If not, then here's an info box that explains you what you would need to do in case you have a different primer scheme. So the workflows themselves don't assume a specific primer scheme, but they really just take the information they find from these two files and go and analyze the data based on this information. So we just now going to import all these four data sets that we're going to you, accelerate data sets that we're going to use via these links from Synodoo again. There are also other places where you can find them and they are explained in this box. So feel encouraged again to use whatever fits best to what you wanna learn. So for the sake of simplicity, I'm just going to copy paste these links again. So I go back to upload data, paste fetch data again and put in these links. And now this time I need Galaxy to auto detect stuff because I'm pasting four different files in there with their own format. So there's not one common file I can tell Galaxy that the results of the upload should be. So I'm just going to start it and hope Galaxy will do the auto detection right. In the meantime, you can see that the upload of a sequencing data started and just now completed. So while we get the auxiliary data into Galaxy, we can already build the collection out of the 36 sequencing data sets that we just imported. So I'm going to click on this checkbox here, Operations on multiple data sets. This will reveal this multi-select boxes here next to my data sets. I'm selecting all up here. So I checked all of them. Now I don't want these four data sets here to become part of my collection. So I deselect them afterwards. And then I'm going to say for all selected now, build a list of data set pairs. So this brings up this dialogue. And now Galaxy tries to be smart and pair the data sets for me already according to the identifiers it finds and the extensions of those data sets. And it suggests this pairing which basically looks right to me. So it pairs the proper sequence identifiers. However, you will see in the middle column here what the data set names will be. And so ideally you would just want to have this sample identifier as your data set identifier later on as your data set name, but you will end up with this dot FASQ Sanger extension. So this is not so nice. So we're going to revert Galaxy's auto detected pairing and pair all this will move everything up into this top unpaired section of the dialogue. And now I'm going to type the real suffixes for all these samples. So it's not just an underscore one but an underscore one FASQ Sanger dot GZ on this side. And the reverse reads carry the suffix underscore two dot FASQ Sanger dot GZ. And now I say auto pair again based on this suffix information and this time Galaxy gets things correct. So now I'm ending up with only the sample accession number, the run accession number from the DNA in that case without any suffix in it. So this is fine. I need to give the collection a name. So let's just call it input sequencing data. And then I want to hide the original elements so that my history looks a bit cleaner. So I make sure this check box is ticked and then I go create collection. And now quite quickly Galaxy hides all these first 36 data sets from the history view and builds this one collection. So now we don't need the operation on multiple data sets anymore. So I click that check box again and I'm back to my normal history view. And in the meantime, the four auxiliary data sets have also been uploaded. Now next thing is I need to check the data type that Galaxy auto detected for these. Remember we couldn't specify them in the upload dialog. So we better check. So this is the reference FASTA file and Galaxy show auto detected the format FASTA in here. So that looks fine. We have that feature mapping TSV file which is supposed to be a tabular file and Galaxy recognized this as tabular. We have the primer bed information about the primer binding sites on the viral genome and this is supposed to be a bed file. So here we have a problem. Galaxy detected this as an interval file and the workflow that we're going to use is expecting this to be a bed file and will break if it isn't. So it's important to change this data type. So how do you do that? You go to this little pencil icon on your data set where it says edit attributes when you hover over it and you go to data types and now you can just override what Galaxy auto detected. So currently the data type is interval and now I'm gonna set a new type bed for this data set and say change data type. So now Galaxy will think a little while whether that can really be true and works on this data set to check that it is a possible bed file and in the meantime we're gonna check this last file another tabular file and again Galaxy has just found out that this is a text file but didn't recognize really that it's tabular. So we're going to change this to tabular just like the other TSV file we uploaded. So again to data types and current type is text we're going to change this to tabular and say change data type and then once this is turning green we are all good to go. So yeah, things worked. Everything turned green here and we set the correct formats of all these auxiliary data sets. As you've seen we can check our list of pairs again. So we have 18 different samples each carrying only their run accession number their European nucleotide archive identifier and inside each of these samples you will have a pair of reads one forward one reverse of the format fast cue singer. So we also have a well-arranged pad and sequencing data collection here. And so we're all set to start our analysis. However, I promised you that we would show you the real thing here. So this tutorial is a bit special in that it doesn't walk you mostly at least through individual steps. So individual tools that you select here from the left side from the tools bar and run individual steps of your analysis one by one because this is something you would never do in production but in production what you're going to use is a workflow or workflows in that case. So most of this specific tutorial will actually consist of you running workflows and to be able to do so you need to get these specific workflows that the Galaxy community has developed. And so the tutorial is going to show you next how to do this. So we're all set with this prepare Galaxy and the data section. We have imported the sequencing data and the auxiliary data sets. Now we can move to from fast cue sequenced data sets to annotated allelic variants. And this is the job of the first workflow. Now which workflow for variant calling you will use depends on which input data set or type of input data you're using to follow this tutorial. If you like me are using currently the Illumina pad and Ampliconic data, either your own or the one suggested by this tutorial, then you will need this workflow here using BWA-MAM as the reader liner and Lofrec as the variant caller. You would need different workflows if your data is not Ampliconic, but whole genome sequencing data, either single end or parent Illumina data, or if your data is Ampliconic data but prepared via ONT technology. So Oxford Nanopole sequenced data because then for example, the read mapper would be changed to minimap two and the variant caller would be Medaka instead of Lofrec. So depending on your type of input data you will need a different workflow. Since I have this pad and Illumina Arctic Amplified data, I'm going to use the workflow made for this one. And again, the tutorial shows you several ways how you get these published workflows from the Galaxy community. So you can get these workflows either from the workflow hub, a collection or site that collects workflows for bioinformatic data analysis of all sorts and Galaxy workflows are one part of that. You can also import workflows directly via GitHub links. So the Galaxy community has an organization called the Intergalactic Workflow Commission that reviews high quality workflows and hosts them in dedicated repositories on GitHub. And so here are the GitHub links for the different workflows. And finally, you can also import these workflows from Dockstore. So I skipped that option two before. So Dockstore is similar to workflow hub, a collection of bioinformatic pipelines and workflows where you can download version controlled workflows and pipelines from. And so we are, which one are we gonna take? Let's take the Dockstore route because this works on many Galaxy instances. So how to do this? We click is explained in this tip box here. So we move back to Galaxy, go to workflow, which brings up all the workflows you currently have in your account. So I have many for you, it might seem rather empty. Then I'm going to import for importing a new workflow into my account. And then you have different options. So you could paste a GitHub URL, for example, in here to retrieve the workflow via its URL. You can have a workflow file on disk, so-called GA Jason file, which you could import here. And the third option that I'm gonna use here is import a workflow from a configured GA4GH2 registry server like Dockstore. And so if I don't know the workflows unique identifier, I can use Galaxy search form to locate it there. So if I click on search form, I'm taking to this page where I can select the TOS server I would like to use. Use Galaxy.eu currently offers two different ones. So workflow hub and Dockstore, the two I already mentioned on use Galaxy.org. For example, you only have Dockstore for now. So I'm gonna use Dockstore because it's more widely available at the moment. And you need to search. And the easiest way if you just wanna locate any high quality Galaxy workflow maintained by the Intergalactic Workflow Commission is to type organization here, followed by a colon, and then saying IWC work flows. So in quotes so that the passing of this dash in here works. So if I do that, I'm getting a listing of everything that's available on Dockstore from the organization IWC workflows, which is the Intergalactic Workflow Commission. So I have a couple of workflows here. And so the one I want is the SARS-CoV-2 pad and Illumina Arctic variant calling workflow. So I'm gonna click on it. And then I see, huh, there are different versions of that workflow. And usually if you don't have a good reason not to take it, you would want the latest version, right? So I'm saying get me version 0.41 of this workflow. And this will put it into my Accounting Galaxy and the workflow will appear now in a second in my list of workflows. So this is the COVID-19 variation analysis on Arctic pad and data imported from an uploaded file. So it comes from Dockstore. And it's an official workflow of the COVID-19 Galaxy project.org project. So what's next? So go back to the tutorial and see how we run this workflow. Do we need to do anything special before? We got the workflow and now we have to run it and populate its workflow run interface. So I'm going to show you how to do this next. So you have that run button here to the very right of the workflow record. So I'm going to run this. That brings up the workflow run interface. And now I have several options. I could send the results of this workflow to a new history, but I don't wanna do this. I wanna keep everything in this one demo history I created. It wants a paired collection of Illumina reads from an Arctic SA, FastQ Sanger encoded. That's good because we have such a collection. So that's our collection input sequencing data. It asks for a reference genome in faster format. So we have exactly that from our auxiliary data. It wants a Arctic primer bed file describing the primer positions and we have that too as a bed file. And it finally asks for the Amplicon Assignment tabular file and this we also have available in our history. So everything is configured correctly by default, but this doesn't have to be like this. So it's always good that you verify the inputs to your workflow carefully because it takes a while to run this workflow. And if in the end things break because you started with the wrong input data, it can be quite annoying. So there's additional workflow parameters here. A read removal maximum AF value, a read removal minimum AF value, a minimum depth for Amplicon bias correction. These are rather special settings of this variant calling workflow from Ampliconic data, which I'm not going to discuss here and we just leave those at the default settings. If you wanna see what those are, you can just expand this and inspect these values and you could also change them in here. Yeah, these are the default values. And so this all looks good. And so we can just simply run this workflow. So when I do this, it will bring up this workflow in vocation interface where I can see the process of the workflows. At the moment it's still invoking this workflow, not nothing's happened so far. And then jobs will be added here to the progress bar as they get scheduled by Galaxy and will turn green as they complete. And at the same time, these things will also appear in the history. So things are starting to come in and at this point you need to be a bit patient. So this workflow is going to run for a relatively long time because it is a complicated thing actually involving many, many steps to call these variants reliably from Ampliconic data. And so you can calculate with maybe one up to three hours for all this data to be processed. So this is a good time for a break. And then come back later to see how the workflow has proceeded. You can watch it still scheduling all the steps which seems to work nicely, but then yeah, don't feel like you have to sit in front of the computer and watch all these data sets being processed. So just to explain a little bit what's going on, so when a certain step gets triggered, you will see that this, because we have 18 samples, each step in the workflow actually creates another collection of 18 data sets, one for each of the samples. And these jobs are run in parallel. So for example, the read quality control and trimming of the sequence reads with fast P, they are occurring 18 times in parallel now on the server. And so you have for each of these collections a progress bar showing how many of the samples have already completed this step and how many are still running or being queued. So this is fully running and about half of the samples have already completed. For the mapping, this is the mapping with BWAMM is already running, but none of the 18 jobs has completed. The read filtering is queued and waiting for jobs here to complete. Now one of the samples has completed and this will now very soon trigger the first run of this read filtering for the sample. Okay, at any point you can look into one of these collections and see what is going on inside of it. So one of the samples has completed more than half are running and the others are still in the queue. And this will go on and on until we reach the end of this workflow up here. So it's now almost completely scheduled and when everything has turned green, then we are ready to continue. So it's a bit more than an hour, hour and a quarter later and things have proceeded quite well. So workflow is fully scheduled and most jobs have completed. It's not yet entirely done, but close. And this gives us a few things to inspect in this history already. So I can maybe use the chance here to point out what the workflow actually does, the different steps it takes and take a look at some of the results files that are of interest. So first off, we already discussed that the workflow starts with trimming, sequenced reads, quality trimming of reads. It then uses BWA-MAM as the aligner for mapping to the viral reference genome. And so what you get out of this, so actually I need to refresh my history to refresh also the displayed collection content that's kind of an annoying bug in the current version of Galaxy. So back into that collection, so it's all done. So what you will get out of this mapping with BWA-MAM is a collection of as many BAM files as you have input samples, so 18 in my case. So here are all the individual BAM datasets describing the mapping of each read in one given sample to the viral reference genome. Then those mapped reads get filtered for pairs for which both reads forward and reverse actually mapped to the viral genome. The mapped reads get realigned then around indels to improve the accuracy of indel calling. For low-freq specifically, we add these indel qualities with this specific low-freq tool so that the aligner, the variant caller can then do a better job calling these indels. And then finally, we generate this dataset collection here, the fully processed reads for variant calling by running IVAR trim over the mapped reads. And IVAR trim is there to remove the Amplicon primers that were used to amplify the starting material in the isolate. So this is the final BAM file that goes into variant calling. If you followed, for example, the mapping tutorial before, that should have a demonstration how you can visualize such BAM files. Of course, Galaxy offers you to just look at the BAM file by clicking on the I icon. And even though that's a binary file by nature, Galaxy can decompress it, turn it into a text file on the fly for you so you can look what's inside there. So it describes the mapping of each individual read for this particular run identifier to the viral reference genome, gives quality scores, alignment descriptions, the sequence of the read and on and on. But you should have learned from the mapping tutorial that you can visualize such alignments with, for example, a tool called IGV, the integrative genomics viewer from the Broad Institute. And if you have that software installed locally on your machine, you can just open it on your machine, click here, this local button, and this will open the currently selected BAM file for visual inspection in the IGV browser on your machine. Takes a little time to transfer the file to your machine and load it into IGV. But once that's done, you can see what's inside. Like in my case, I pre-prepared this to be a bit faster. So for this particular sample, if we just look at the region around and inside the S gene, for example, this is the picture we see. So these are all the aligned reads displayed as pairs. And so the colorful stuff at the end of each pair, at the two ends of each pair, let's actually the trimmed primers used for amplification. So this tool, Iva Trim, inspects that bed file with the primer positions and then looks at the aligned reads and then can tell from that alignment, this read starts at a primer binding site with this sequence and a better clip it off the read so that it doesn't influence variant calling. Because the issue is if you have a variant call at a primer binding site, then you will have some reads as shown. Well, what's a good example? Let's take this one as an example here. So imagine you had a mutation here at this particular primer binding site. Then of course you could detect this mutation inside this Amplicon here, which overlaps with that mutant site. But these reads here from this Amplicon, they would all have the primer binding site, which is of course the primer sequence itself, which is of course designed to match the reference sequence. So in that case, like only half of your reads would actually show the variant, even though it's present in all of the viral copies. And so in order not to fool the variant caller here, it's important to do this soft clipping of the primer sequences from the reads so that then the allele frequency called for a variant here would amount to one and not 0.5. So this is the final step we're taking. And so you can clearly see here, these are Ampliconic reads amplified with the correct primer scheme because IVATRIM can detect those primer binding sites on basically all ends of all reads we have. So then let's further inspect what we have in Galaxy. So these fully processed reads then go into variant calling. That's the next step here. And that's Lofrec, that is used then for this variant calling process and it produces 18 VCF files describing variants in each of your samples. At the same time, we use this BAM file and also the preceding steps to calculate some quality measures. And this then ends up at the top of the history here. That's the pre-processing and mapping reports. So this is an overview report generated with the help of a tool called MultiQC. And it has a nice HTML display of general statistics of all the reads in all samples. So this is quite convenient because I mean, right now we only have 18 samples, but in a real high throughput situation, you might have 96 or more samples in one of these analysis patches in one history. And then you do not want to investigate statistics on a per sample basis and click through data sets in a Galaxy collection, but you would want to combine report and this is what's produced as this pre-processing and mapping report. From this you can learn what the average coverage, so median coverage, mean coverage for all of these samples was. You see it's quite different actually for the different samples, but good enough in all those cases, presumably. So the worst one I'm seeing here has a median coverage of 2,800 something bases, reads per base on average. So still 95% of all sites in the genome are covered at more than 30-fold coverage. Yeah, mapping statistics are extremely good, but suspicious this is actually because we used CoQK data here and the CoQK data that gets uploaded to the European Nuclear Tide Archive is pre-mapped. So before submitting it to the ENA, they make sure they submit only mapped viral reads. And so it's reassuring, but to be expected that you find 100% mapped reads here when you re-analyze that data. This would be different if that's your own data and you have potentially unmappable reads in there as well. Good, then we have a few mapping statistics. This is again coverage, but maybe easier to understand the cumulative genome coverage plot here. So here you see again the vast discrepancies between your different samples. So like this sample with the worst coverage has a very rapid drop-off in coverage, exponential drop-off to its higher coverage. So while 95% of reads have a coverage better than a few hundred reads, the actually more than a 10,000-fold coverage is only achieved for like 10% of the reads in this sample. If you compare this to the extremely opposite sample here, then like even at a coverage of around 25,000 reads, you would have still more than 50% of the reads in this sample reaching this coverage. So there are vast differences even within the same preparation of sequencing reads from an established project like Cog UK. Don't be surprised by this, this is just how it is. As long as the overall coverage is good enough, that's not a real issue. But yeah, your pipeline has to handle these vast differences. Yeah, GC distribution across reads looks pretty much the same for all of the samples. Not a big surprise since it's all task of two samples. So the underlying GC content distribution of the genome should be the same for all these samples. Then you have some more mapping stats. So how many reads you have in each sample and what percentage of which that's shown in blue got mapped to the viral genome? More detailed metrics. So like each sample highlighted for all kinds of different stats. So this sample that I'm highlighting here that I'm hovering over, the one with the best number of total sequences has also the most mapped reads, has also the most properly paired reads, has also the most inward pointing pairs of reads. So where forward and reverse reads are on opposite strands of the genome and their three prime ends are pointing towards each other. So this looks all quite good. You can see the effect of the fast P action on the trimming, the insert size distribution in this file. So of course on the upper end that's limited by the size of the amplicons generated by the Arctic protocol. So the median kind of insert size here is around 380 basis. You can see the effect of read filtering for the forward reads and for the reverse reads. Again, there is not much of an effect here because this data is, as I said before, pre-processed by the Cog UK project already. So our filtering doesn't do much on top of it anymore. But this will be different for your own data. Same for the end content. There is hardly any ambiguous nucleotides in these sequence reads because they've been already undergoing filtering on the Cog UK side. But this is the kind of repo you're getting out of my DQC. And then of course, after variant calling things become interesting because that's the information we really want. So which of our samples has which variants in its genome. And so essentially that information is all present already in that first called variant data set down here. So this one is already a VCF file per sample. So 18 different VCF files and has most of the variants. We're then going actually a long way to improve these variant call statistics in this workflow. And that's a step that we specifically do for Illumina Ampliconic data only. So just briefly what we are trying to achieve in these steps following the first round of variant calling is to compensate for potential Amplicon bias introduced by mutations at primer binding sites which could in turn affect the apparent allele frequencies of nearby linked mutations. So essentially we're using the results of the first variant calling rounds to identify mutations that could introduce such an Amplicon bias. Then we recall variants without those potentially affected Amplicons compare the results to the original calls to see what changes if we do so and then try to get the best out of both rounds of variant calling by intersecting the results. So long story short, this is quite computationally expensive and has usually a minimal effect on called allele frequencies. But if you wanna do it very accurately like we want and you have a lot of compute then this is what we recommend to do and what we do in our Genome Surveillance Projects in production. Then yeah, after that those final VCF files then still get annotated with a tool called SNPF which adds all the information about affected viral genes and so on to that file so that final file then looks something like this. So here you can see now one line per mutation in that particular sample. And so you have a reference allele and what the variant caller instead thinks is the alternate allele in that place. You have a quality score. So how should the aligner really is that this mutation is there. You have allele frequency values. These are these AF statistics in the info column. So these just magnify that a bit. So these allele frequencies are often very, very low. If this is probably just noise or very, very low frequency mutation that has arisen inside that particular patient and we're going to filter these out later in our reporting and consensus building workflows but for now in this VCF we really call all potential mutations that the variant caller was able to pick up. Then DP is the coverage information so how many reads were we had at that site to make the call and DP4 lists like which reads supported the variant allele and which ones supported the reference allele on each of the two strands so forward and reverse reads. Then whether there's a strand bias effect so the variant being supported preferentially on one of the two strands. And then we have these annotations added by SNPF as the F field in the info column. So here you see that this would be a non-synonymous coding change or a missense allele at the minor asset level and it would affect in that case a part of 1AB and cause a E93 to D change in that protein. So this is the information that's in such a VCF file and the next workflow that we're gonna run right afterwards is then producing human readable NICER reports for a batch of samples out of these because as you can imagine this would be extremely tedious now to step into each of these now just 18 but in production maybe 96 VCF files and try to make sense out of all these statistics somehow. That's not a nice job. So the next workflow we're going to run actually has the purpose to take all these VCFs and transform them in a per batch variant file listing all the variants in this particular batch and which samples were affected by these variants in a nicely readable tabular file. And this is what we'll see once this is finished. We can actually already start setting up that workflow now cause we have the full workflow here scheduled and so even a job for which not all comprising data sets are processed yet can already serve as input to the next workflow and the workflow will just wait until the sample is ready before processing it further. So right now you can see that all my samples have passed the first round of variant calling most already are through the second round of variant calling for the serial frequency correction and almost all are also already annotated. So let's jump in and to the tutorial again and look at what next step is supposed to be. So this is the step from annotated AVs, allelic variants per sample to an allelic variant summary. And for this we need what we call the reporting workflow. Let's yet another IWC intergalactic workflow commission maintained workflow which we highly recommend to use instead of your own homegrown solution. And so again the same options that we have before you could take it from the European workflow hub you could take it from Dockstore or from the GitHub repository and for consistency I'm just using Dockstore here again just as I used it before to get the first workflow the variation workflow to now also get this one. So I go to workflow again, list the workflows I already have in my account. Then I go to import and from there go into the search form for Dockstore then change this to say Dockstore here and then I type my query as before, organization, colon and in quotes IWC workflow and now I'm getting again this list of IWC owned workflows on Dockstore and I can see one of these workflows is called SARS-CoV-2 variation reporting. And now importantly at this point we can treat all different types of input data the same. So both the reporting workflow and the consensus workflow are technology agnostic workflows. So we had these four different types of variant calling workflows for Illumina, Pad and Single and whole genome sequenced versus Ampliconic or Oxford Nanopore technology Ampliconic data, but the VCF files produced by all these workflows are standardized enough that we can report them with the same downstream workflow and then we can build consensus sequences from them with the same downstream workflow. So there is only one variation reporting workflow and there is only one consensus from variation workflow. So now let's import this one reporting workflow. You see it has two available versions like before we pick the latest one version 0.11, we import it into Galaxy. It's getting added to my accounts workflow list. So here it is. Actually this workflow comprises a surprising number of steps too, but those steps are computationally much simpler than the ones in the variation workflow. So it will complete much faster. So don't be worried about not finishing today. Once you're done with the variation workflow, you're in like 80% done in terms of computation time. So let's run this second workflow. Again, that run interface asks us for a number of things. So let's start with the input data first. So the ones where I can select the file or collection or multiple files. So first thing, the variation data that we wanna report. So this is one of these final SNPF annotated variants. And now it's important that you pick the right one here. So all our variant calling workflows produce two final encodes, annotated variant collections, one without and one with the strand by a soft filter applied. So in one of them, variants that are above a certain strand bias threshold get flagged as not good non-usable variants, not trustworthy variants. And in the other one, this flag is not present. And now it depends on your type of input data, which one you wanna use. So for Illumina whole genome sequencing data, we do recommend using the strand bias filtered input data set and it will only report variants without that strand bias threshold violated. For Ampliconic data, be it Illumina or Oxford Nanopole Sequence, we do not recommend it. So we still compute that file for just out of interest to investigate further what the differences are, but we really recommend that you're working not with this now suggested file, the one with the strand bias soft filter applied, but with the file before that, the final SNPF annotated variants file. Because for the Ampliconic data, you may occasionally miss a very clear variant which fades the strand bias filter. We're still investigating how to address this, but for the time being, just go with this file. Okay, we have now a couple of other things to set. First is this gene product translation file. That's actually one of the auxiliary files we imported early on and it's feature mapping TSV file. So this will translate viral gene and transcript names as GenBank knows them into the names you expecting as a virologist, maybe like the S gene, the M gene and so on. So it will make that report more human readable that we're going to generate and this contains the translation between these GenBank filter, these GenBank names and the ones commonly used in science. Okay, that's our second input and then we could or could not configure these default settings here for certain other things. So this determines which variants will get reported as trustworthy by this reporting workflow. So there's an allele frequency threshold we could set and per default it's at 0.05. So 5% of reads at a certain site confirm a variant then this variant will get reported or if it's more than this fraction. So we can leave that as the default because these workflows are really quite sensitive and quite reliable even down to this threshold. There's a depth filter. So how many reads minimum you have to have at a site to trust the variant called there? We can leave this at one because actually there's an alternative filter down here. The DP alt filter. So that's the depth of variant supporting reads at a site required to trust that variant call and this is set to 10. So if that's set to 10 then of course the depth cannot be the total depth at the site can also not be smaller than 10. And then for plotting the variants in the batch which we will take a look at later when it's ready we need a clustering algorithm to group similar samples together that share kind of the same variants. And this is the number of clusters of samples we're expecting. So of course we cannot be sure about this right now but if you have a sample size of 18 then maybe it's reasonable to expect like three different clusters say lineages inside there or sub lineages. And if you're expecting more or less because this is a very diverse set of data or a very similar set from let's say one hospital outbreak for example, then you could reduce this number of clusters but three is here probably a reasonable expectation and then we are ready to run this workflow. So like before it's getting scheduled now that progress bar for this particular workflow invocation will proceed and turn green step by step. And you will see now that we are using this final SNPF annotated variants collection as the input but not all of that collection is ready yet. There's one sample that is not yet done variant calling so SNPF hasn't worked on it yet. And now still that new workflow gets scheduled now and the steps that are supposed to work with this final SNPF annotated variants collection they can already work on the 17 samples that are ready inside that collection but the one job working on the non yet ready sample will just be paused until that data becomes available. And of course at the end because this reporting workflow tries to combine the variant reports from 18 different files into one at the end, this workflow will have to wait until all individual samples have been fully processed but that's only happening at the very end of this workflow. So the whole scheduling can be done already and most jobs can already run right now. So another half an hour later and all our scheduled steps are done the workflow is run to completion for the reporting and so let's take a look of what we got out of it. So you can see here all our variants have finally been called for all of the samples and that's why the reporting workflow could finally create this combined variant report by variant and this combined variant report by sample plus a variant frequency plot which we'll investigate in just a second. So let's jump back to the tutorial and see what that has to say about this. So the tutorial is explaining the three key results data sets produced by this workflow. So the report by sample is essentially just the concatenated variant files for each of the samples in the batch. So starting with the first sample listing all its variants then going on with the second sample so you will have 18 variant reports in tabular format in this case here in this example all concatenated together and compared to the original VCF it's somewhat simplified information so it lists the sample that the current line holds information about the position whether this is considered a filtered variant based on the variant calling results so whether you should trust it or not then the ref allele, the alternate allele, the depth of coverage, the allele frequency, the strand bias value, the DP4 value we discussed already and then the annotations from SNPF in a more readable format and so on and then whether this allele that's relatively important whether the same variant allele has been observed in other samples of the batch and if so in how many and what was the minimum allele frequency observed in this batch for this variant allele and the maximum allele frequency so let's just take a look at this in real in this history so the combined variant by sample report is this one it lists a total of 867 so there's one header line so this figure here is one more than their total variants listed in this file yeah, position this sample affected by this variant here depth of coverage allele frequency of 0.8 in that case so 80% of all reads for this sample at this position confirm this variant allele in total this has been seen in several samples one is ours has the maximum allele frequency for this batch and there is a minimum allele frequency and another sample at 0.75 allele frequency and so on and so forth so this just iterates through all the samples that we've analyzed but now everything is combined in one file another way to look at this is the report and with this you can answer based on this report alone you can answer some of these questions in the question box here so for all these questions as you've seen in previous tutorials you can expand the solutions but you're encouraged to first try to answer them yourselves so how many allele variants are found in total across all samples I already gave away that answer how many allele variants are found for just the first sample in the document and statistics how many allele variants are found for each of those samples how would you generate those some of these just you need to take a look at the data set itself for others you need to run tools but I leave this up to you to just cover the solutions the variant by report by variant now only looks at unique variants that were called in any of the samples and pulls the information for this variant across samples so the allele the variant report by variant is shorter it consists of only 184 lines so overall in this batch of 18 samples there were 184 distinct variants observed and some of them were seen in several samples and this is what this report summarizes so it has starts now with the position the ref and the order leal so for example this mason's mutation here at 5388 in the genome c changed to an a mason's mutation effects of 1ab and it has been observed and that's the column here the count unique samples in 15 different samples where the sample with the minimum showing this allele at the minimum observed allele frequency only had it observed in like every thousandth read at that site while in the sample with the highest observed allele frequency it was observed at 0.9 frequency so 9 out of 10 reads supported that variant in that sample and then it lists the samples that show this variant above threshold there's only two out of our 18 samples that have this variant found at above threshold levels and these threshold levels refer to these filtering parameters we set when we ran the workflow so we configured a minimum allele frequency of 0.05 and we configured a minimum supporting number of variant allele supporting number of reads of 10 in that workflow run and so in two samples we see this variant supported by at least 10 reads and at an allele frequency above 0.05 in the next column all 15 samples get listed so it mentions 15 total samples here out of these only two have this variant above thresholds so there are 13 more and this column lists them that show this variant but where either the allele frequency detected is too low where the number of supporting reads isn't sufficient to pass the threshold and these are now all the allele frequency in all samples that this variant has been observed in and that's the change again that we're talking about yeah so this is the information in the variant per variant report and so both of these reports have kind of the justification and there's different answers you can derive from each of those so this variant report by variant for example is pretty good to answer these questions listed here so many allele variants are found distinct the allele variants are found in total what kind of different impacts effects on the virus do these allele variants have how many for each impact class how many different effects with a predicted high impact and are there any variants that are found in all 18 samples in the batch so again you need a couple of galaxy tools to answer these questions and again I leave it up to you to answer these and the solutions could be shown by clicking on the plus icon here and then finally one thing I really like personally a lot is our variant frequency plot as long as you're dealing with moderately sized batches like this one or maybe up to just a hundred samples or so this is a really nice way of showing what's of visualizing what's going on in this batch of data variant frequency plot well it's an SVG file so if you click on it like this it might not show because it's too large so you would need to go to show all and this opens the complete image file and presents it in your browser the browser is actually a good way to plot to show SVG files we can hide the tools bar to see a bit more and now what this plot does is it shows you all sites that are variant where variant was founded at least one of the samples in the batch every row in this plot represents a particular sample of the batch so there are 18 rows here and every column is one of those variant sites founded at least one of these 18 samples and so you see that at a glance that quite a lot of mutations are shared between your samples the color legend up here is actually the gene that's affected by the variant so all these first variants here in the yellow part are affecting off one AB the precursor protein of the non-structural proteins of the virus then you have the S gene here and then the others and there's a legend at the far right of this plot and the effect of the variants is also color coded in that second line and the color of individual cells of variants found in one particular sample actually indicative of the allele frequency at which this was observed so the darker the color the higher the observed allele frequency of this particular variant in this particular sample and so most of them are called at really high allele frequency there are a few questionable ones that are only found in particular samples at very low frequency and then as I said this plot is also clustering your samples according to similarity of the variant patterns so you can quickly see if like all your samples are likely from the same lineage or from out there several different lineages occurring here so this is data from earlier this year when we didn't have much of the delta variant yet present in the UK but mostly the alpha lineage so the whole upper group here is actually alpha lineage if we zoom into this part here of the S gene you will see that one of the mutations here in the S gene the one at it's a bit hard to read the legend until you go up to really high magnifications but one of these yeah the mutation here at 23,063 and 820 that's at the minor acid levels that's the N501Y mutation one of the hallmark mutations of the alpha lineage which is not present in the delta lineage so clearly that first cluster of samples the top most large cluster of samples here that's the alpha lineage isolates in this batch and then the others are less clear maybe one of them is already a delta lineage sample and the others might be more original Wuhan coronavirus like but that's a bit hard to say without zooming into all these variants of course but this simple clustering by similarity already gives you a nice overview of how much diversity is inside your batch and you would also have a chance to spot probably library preparation issues like if you have a sample that has many low allele frequencies so these pinkish light colored variants present then probably the sample is not of very good quality overall but here everything seems to be rather fine all went smooth so this batch is pretty much okay yeah and that's the three major things you can get out of this variant report now what most people are really interested of course in is the consensus building and for doing this consensus building we need that third and final workflow of the chain of workflows we're using typically in the COVID-19 galaxy project and we're going to import this now also from dock store and you've seen it repeatedly already so I go to import again I'm going to the search form again and the final time I'm going to change this to dock store and type organization colon and encodes my IWC workflow search term and this time I'm interested in getting this single consensus from variation workflow that again works for all the different input types be it arctic amplified or not be it O and T or Illumina we take the latest version 0.2 one of this workflow imported into our accounts workflow list and we are ready to run it and now this particular workflow asks for variant calls again so this is again the VCF collection annotated with SNPF that we used as input also for the reporting workflow so we go scroll down here and find it again so it's this one the final SNPF annotated variants I selected it as input and then it also needs the aligned reads for depth calculation so this is the fully processed reads with primers already trimmed by IVAR trim that's the one that went into the first round of variant calling and that's the file we want here and then it wants a reference sequence and that's it and you can normally leave at least I mean these are recommended default settings for this workflow assuming you have run the variation workflows with default settings before these should fit so you have a minimum allele frequency to call something a consensus variant of 0.7 this makes sure we catch most or almost all of the consensus variants there are in typical samples and I'll explain this minimum allele frequency step why the workflow is running and also the depth threshold for masking so we just run this now with these settings so we're using the output the two outputs fully processed reads and the final SNPF annotated variants from the variational workflow and just launch this now you've seen the scheduling of the workflow so datasets will appear soonish in our history and so in the meantime what is this consensus workflow doing that makes it also a bit special compared to many simpler consensus workflows out there that people are using so this particular form of a consensus building workflow tries to find a good compromise between calling as many potential variants capturing them in the consensus sequence and also highlighting positions that are questionable where we can't be sure whether it's a reference allele there or maybe a variant allele so for this it uses these two different thresholds that we've seen so if we have a variant that is above this threshold that we configured now to be 0.7 then it will always make it into the consensus sequence we're building right now if it's below that and it's below this consensus threshold and the lower threshold which is now set to 0.25 then this variant will is considered questionable yeah it's it's neither it's a little frequency is neither low enough to be negligible so it seems to be kind of real but it's also not at 0.7 or above so it's not justifying it to be called a consensus variant because there's a majority or good proportion of reference reads as well so if that's the case then we incorporate an n for ambiguity into this consensus sequence and the same we do if the coverage at that at a position is below our depth threshold of 5 which we didn't change from the default values either so if we have a a site in the genome where we have less than 5 reads evidence for any of the possible bases then we put an n in the sequence because we cannot be sure however if that site also has a consensus variant above an af threshold of 0.7 then we never mask that so even if we only have 5 reads coverage or less but the variant caller called a good call for a non-reference variant allele we make sure we keep that because we think it's more important to keep to keep the information that there is a potential variant there than to say this is bad coverage but if it's a reference allele and it has bad coverage then we say well we cannot show whether this really is a reference position maybe we just could not detect the variant that's sitting there because coverage was too low so this is the rules that are applied here and that's why this workflow also consists of quite a number of steps but they are all really fast running so it doesn't take long to build this consensus sequence but it's quite an involved procedure and what you're getting out in the end is two different things one is a faster sequence for each sample as a collection of data sets so if we have 18 input samples again we're getting 18 faster sequences out and this is where they eventually will end up in and the final thing is then that this workflow bundles everything into a multi-sample consensus faster so all the faster files from all your samples combined into a single file because this is very convenient input for tools like pangolin and nextclade that we'll discuss next so here you can see how the workflow operates so it locates it locates regions of low coverage from the fully processed read files it finds these filter failed sites so variant sites that are in between the lower and the upper threshold so in our case between 0.25 and 0.7 and then it combines these two sets of sites the ones with low coverage and the filter failed sites into one file and then it subtracts from this combined file the consensus variant positions and the remaining positions will be the ones we mask the consensus sequence itself is built with a tool called bcf2 consensus which just incorporates all the consensus variants we call into the reference genome that we pass in so the consensus workflow has run to completion we obtained this collection of 18 individual faster files and the concatenated file of all the faster sequences so each sample proceeded with a headline and then ends incorporated and all the variant alleles incorporated into it and this for all the 18 samples we have in the batch so here's another one the start of another one and as I said these multi-sample consensus fastest they are the perfect input for tools like pangolin and nextclade which the tutorial proposes to run next so this is from consensus sequences to clade and lineage assignment then so one thing you might be wondering immediately is why these pangolin lineage assignments nextclade lineage assignments are not part of this workflow well different reasons first of all not all people want the same thing there's less agreement here whether you would want to use pangolin and or nextclade second thing is it's really just the one step procedure to go from the multi-sample consensus file to the lineage assignment so you can easily do this by hand and another important thing is that for example pangolin uses a model a machine learning model to call these lineages that's updated regularly so essentially even already assigned lineages from older samples might change if you run pangolin again over them and so the time when you analyze your samples might not be the time when you want to assign do the lineage assignment but maybe if you're adding these samples to a dashboard you would want to run pangolin only as the last step before putting the samples there to have the most recent lineage assignment at the time of data deposition or whatever so there are different usage scenarios and so that's why we figured it's best to keep things modular and to leave pangolin and nextclade out of these intergalactic workflow commission maintained workflows and just users can run them whenever these tools whenever they see fit so it doesn't take much as I said to run pangolin the tutorial explains how to do this so you just call the pangolin tool with just one simple parameter if you want to use the default settings which is mostly fine so you use the multisample consensus faster that we just generated in the last workflow as input and then if you do this pangolin at runtime we'll download that latest pangolin model from the web at runtime so it's important that you do this on the combined multisample consensus faster or you will trigger as many downloads as you have samples in your batch and at some point github might just not allow more downloads anymore and then you're blocked so it's good practice to do this on the multisample consensus faster so you will get the very latest information from the web that's current right now we also have a on use galaxy eu at least we have some snapshots from earlier pangolin models that you could also use if you want reproducibility but usually as I said you want the latest information available and yeah that's it you just execute that too and pangolin is also pretty fast so there's not much time between starting it and obtaining the results and the tutorial then also describes the same thing for next clade which is also super easy so you just provide the the multisample consensus faster again and you tell it that you want a tabular report for example is the output and that's that um I won't go into detail in this video now for both outputs so I'm just going to show you pangolin and again I leave the question to answer up to you by now you should know a good tool good galaxy tool useful to answer that question otherwise you can peek into the solution if that's needed let's see so pangolin emits a couple of interesting fields so the text on field is really the first column of the results file it's really just your sample identifier the lineages the thing most people are interested in so whether it falls into if your sample is part of any of the well-known sask of two lineages that are currently known so in it confirms what we already saw in the in the plot in the variant a little frequency overview plot most of these samples are actually classified as b117 which of course is the alpha who assigned lineage so the the scopio calls the scopio is a software that comes bundled with pangolin and that manages to emit these the simpler classification scheme of the world health organization while pangolin uses this pangolin specific internal lineage names in the output so delta is b1 617.2 of course and and alpha is b117 so you see we had already almost exclusively alpha in it and then we had actually three samples of delta variant in that batch and just one older type of lineage it's actually eta so it's also one of the more recent lineages in there the ambiguity score is actually better if it's higher so if it's one then there is no ambiguity at all so pangolin is sure about the quality the same is true for scopio with the scopio support so if if it's one then scopio is sure about the call it made so higher scores are better in each of those cases and there's a pangolin conflict value which is better if it's lower and the scopio conflict value for it which the same holds true and then scopio also lists how many of the lineage defining alternate alleles it could find for specific lineages lineage and how many were unexpectedly reference alleles if the lineage call is correct and so on and so forth so this is the simple output that pangolin gives you and yeah with that I think we are at the end of this tutorial and so you saw that it's actually really easy to just combine this set of workflows so we started out with one of several types of different input data we picked the right variation workflow for it to call the variance and that results in a vcf file at the end and then we just used the reporting workflow and there's only one of it to turn this vcf into tsv tabular representation to create the plot and we also ran and that's not shown here the consensus workflow and obtained faster sequences that we could easily feed into pangolin and next clade and so this small set of workflows which can be used rather straightforward and all the big galaxy instances or also your own if you install them can really do all this task of genome surveillance from sequenced reads of different viral isolates to final lineage assignment done with pangolin or next clade