 Hi everybody to this brief demo of the tutorial mutation calling viral genome reconstruction and lineage clay assignment from SARS-CoV-2 sequencing data. My name is Wolfgang Meyer, I'm working for the European Gellix team and I'm going to walk you through the most difficult parts of this tutorial. So to take a brief look at the outline of what we're going to do here is we're going to run a sequence of production ready workflows that can be used to perform a full SARS-CoV-2 sequencing data analysis complete from raw sequence reads up to quality control reports, consensus sequences of the viral genomes of your different samples, so the analysis works with complete batches of samples like they would come out of real world genome surveillance projects. We're going to look at reports and visualizations both at the sample and at the batch level and we're going to do lineage assignment using next clay and pangolin. So the workflows that we will be running and that are really at the core of this analysis are one of these four workflows for going from raw sequence reads to mutations called per sample and because we're going to follow along with the suggested example data from this tutorial the workflow we're going to be using is the one from Tiled Amplicon Illumina pad and for Tiled Amplicon Illumina pad and sequencing data. It will result in a as its main output in a collection of VCF datasets that holds those mutations per sample in the batch and then next we're going to run a variation analysis reporting workflow that produces these reports and visualizations for us and we're going to have a brief look at what's inside these reports and plots. Then we're going to run a third and last workflow which is the COVID-19 consensus construction workflow and it will incorporate it will take those mutations per sample produced by the first workflow and it will incorporate them into the viral reference sequence so that we obtain consensus genomes of all of our samples and then with these consensus sequences we're going to do the lineage assignment with next Clayton Pangolin which is running a few tools manually inside Galaxy. Okay then let's start because this is going to be brief and just explaining the more complicated steps in the tutorial. I already set up my analysis history so I created a new analysis history I named it and I followed the instructions in the tutorial to upload all the necessary input datasets so just to confirm that everything went all right here so what I have is I'm going with the described example data as I said in of this tutorial so we're going to analyze a batch of very early in fact the first batch of Omicron data that came out of South Africa at the end of 2021 and those are we pick the subset of that of that batch and that's 16 different samples so I have here a collection a list of 16 pairs of fast cuasing a g-cipped dataset so raw sequencing data that was paired and sequenced on the Illumina platform so if I step inside this I see those 16 samples with their NCBI short reads archive identifiers as the element names here and each of these elements actually is a pair with datasets as it says here so if I click on one of them I should see that indeed there's a forward and reverse dataset contained inside of those and each of these datasets is of the format fast cuasing a g-set so g-cipped fast cuasing a which galaxy still allows you to preview here as plain text so going back out of this list of pairs so we confirm this input is what it should be then I've downloaded this ask of two reference sequence just as described in the tutorial and I ran the replace text tool on that as described in the tutorial to change the name of the sequence identifier to nc underscore 045512.2 according to the instructions in the tutorial I renamed this dataset to sask of two reference because that would be the input to the workflows that we're going to run next so this is okay too also note that the format here is faster then I've uploaded the sask of two feature mapping according to the instructions that's a tabular file with the different ref sec identifiers of all the open reading frames peptides in the sask of two genome in the first column and then the more readable common names of those features in the second we can also go to a full view of that dataset in the middle panel here and you see that this looks pretty much as the tutorial shows it important format is tabular so that's correct I've also uploaded the primer information in bed format so that's important that this dataset is a format bed if you look inside what this dataset describes is the different primers used in the arctic v4 primer set for the tiled amplicon amplification it lists one primer per line it lists where it binds with respect to the sask of two reference sequence it lists the name of the primer to which amplicon pool that primer belongs in the fifth column and in the sixth you have the strand that on the reference genome that this primer is supposed to bind on don't worry about the names here of the sequences so these are actually the instc identifiers this is the instc identifier of the sask of two reference sequence the tutorial talks about the equivalence of this identifier and the ncbi ref sec identifier for the primer bed file it really doesn't matter which one is listed here the tools that we're going to use with this primer bed file they will not care about this name as long as there is just one single name in this file good so dataset data type of this dataset is bed format so that's fine and then we have a second file describing the primer scheme and that's not to be confused with the first one so this dataset is the amplicon info file and its data type should be set to tabular and if we're looking inside this dataset we're seeing that is a tabulate dataset with one line per amplicon produced by the primer scheme and it lists the primers that are responsible for the formation of a given amplicon per amplicon one on each line so we have the the primer number one left and the primer number one right on one color on one line because together they form the first amplicon in that arctic v4 primer scheme and then we have one line per amplicon and we continue like this so normally each line has two primers for some primer schemes that use nested primers to produce a single amplicon with higher specificity or efficiency there might be more than two primers on a single line good so all the input datasets seem to be ready so then we can actually grab the first workflow and we're going to grab this from a from a workflow registry so I'm going to workflow in the top menu of galaxy I'm going to import one and I'm going to import one from a GA4GH compliance server and now I have the choice between workflow hub eu or docstore the workflows deposited in those two registries in terms of the saskof 2 galaxy workflows are identical so the choice here is really yours I'm doing this from Europe so I'm going to access the workflows via workflow hub eu and then we're going to perform a search to find what's available in terms of saskof 2 analysis workflows for galaxy on this registry the tutorial says to type the search string here so I'm going to look for organization term IWC so now I'm filtering for all workflows contributed to workflow hub via the intergalactic workflow commission a group within the galaxy project that takes care of producing high quality workflows for different kinds of analysis and for reviewing those and for submitting those to the workflow hub into docstore we are looking more specifically only for saskof two workflows produced by that organization so I'm going to include the search term name and then saskof minus 2 in here and then I'm filtering only for the workflows of interest for this tutorial so I said the example data is Illumina pad and sequence using the arctic primer scheme v4 so this one here saskof 2 pad and Illumina arctic variant calling seems to be the one we want I'm going to click it and then I have the option to select any of the published releases for this workflow that's available on workflow hub and usually if you don't have a good reason not to go for it you would want the latest version of course so v0.5 in that case and clicking on it imports this workflow right into galaxy and from there you have the option to run the workflow and that's what we're going to do here it brings up the workflow run interface and now we have to make sure you we populate the different inputs with the corresponding correct input data sets so it asks for paired collection of sequence reads and yeah I have this just one list of paired data sequencing data sets that's called sequencing data that's my data set 33 here so that's already chosen as the right input here it asks for a faster sequence of disaster cuff 2 reference and I prepared that too so that's fine arctic primer bed needs to be that bed file with the primer information and the binding sites on the reference so that's the correct file and the arctic primers to amplicon assignments file needs to be the amplicon info data set which is tabular and lists the primers on one line that together are responsible for the formation of one amplicon so this is all auto populated correctly by default luckily here it doesn't have to be the case every time in in your situation if you set up your history slightly differently there might be mix-ups so please check these inputs carefully yeah and then we can leave the rest of the parameters at their defaults and basically just run this workflow you would have the option if you click this little gear wheel icon here to send these results to a new history if you prefer to keep this prepared history here pristine and just create a new history where all the workflow results are sent to you could check this little box but for the purpose of this tutorial it's actually preferable I would say to have everything go to one history so I just press run workflow and off we go and yeah then the scheduling gets prepared galaxy waits for this invocation of the workflow to complete and you will have progress bars visible here and once everything is scheduled we can look into the different results okay so all 33 steps of this workflow have now been scheduled successfully they are not finished but the history will not acquire any new data sets from this workflow run anymore so it's in good shape and we're just waiting for all these remaining jobs to run and finish but at this point we are good to set up the next workflow run and going back to the overview that will be the COVID-19 variation analysis reporting workflow that we're going to run next it will run on the the collection of VCF data sets with mutations per each sample in our batch and create those plots and reports and like the workflow before we're going to obtain this from workflow hub it's the exact same procedure so we're repeating what I've shown you before go to workflow to import import from a GA4GH server choose workflow hub or docstore whatever you prefer we're going to type our search query again so that was organization colon IWC and then name filter sars minus cough minus two and we close the quotes here and we're back here and the top hit is actually the COVID-19 variation reporting workflow we want so we expand that we see the different releases available for this workflow we pick the latest and we're getting this imported into our list of workflows we're ready to run it and that's what we're going to do we bring up the workflow run interface and now this workflow asks for two input files that's the variation data to report up here and it's the gene products translation files down here so the variation data to report you need to be careful to choose the right one so the one that it suggests here is final snippet annotated variants with strand bias soft filter applied if you read the tutorial carefully it will warn you though that for tiled Amplicon data this is an experimental file because that strand bias filter used by the variant caller is not totally adequate for tiled Amplicon data so in this case when we're following along with tiled Amplicon data no matter what primer scheme we're not going to use this file but we're using the parent of this file the simple final snippet annotated variants file and ignore the later one which is just experimental and you can expect you can expect it to learn what the difficulties are there but go with this simpler file here we can leave all these filters at their defaults but here we have to take action so this gene product translation files that's the feature mapping file we uploaded initially and Galaxy chooses here another tabular file the Amplicon info so we have to correct this we want the SARS-CoV-2 feature mapping file and so this file will be used by the workflow to produce more readable names for the different genomic features of SARS-CoV-2 in those reports so it's this file down here the two-column mapping between hard to read NCBI Refseq identifiers and more commonly used names of features then we're setting number of clusters to three this will affect the variant frequency plot the major plot produced by this workflow and I will just tell you how many clusters to consider main clusters in the batch so when it does hierarchical clustering of the of the samples in the batch based on the on their mutation profiles how many of those clusters to treat as main clusters and set apart visually a little bit so the choice here is somewhat arbitrary and we just go with three because for the suggested batch we established this to be a good number okay then we're ready to run this workflow and it will just add datasets on top of the ones produced by the first workflow and then we're back at this progress monitoring page again where we see that Galaxy is now preparing the invocation of this new workflow and it will add datasets and summarize the states of the different jobs okay now so at this point the first workflow the variation analysis workflow has finished all its outputs are ready and we can explore a few of those key outputs now even while the scheduling of the reporting workflow is still ongoing so if you remember we started out with the raw reads here and the steps that this workflow now performs is it first does quality assessment of these Illuminal sequenced pad and reads with the tool called fastp and it produces quality metrics for the raw reads so letting us spot problems with the sequencing itself early on then it also performs trimming or even filtering of complete reads if their quality is really poor so that we only feed good quality trustworthy reads into the further analysis we then map these reads with the tool called BWMM to the SARS-CoV-2 reference filter those mapped reads again based on their mapping quality and whether the two reads from a pair could both be mapped to the reference or not we realign those reads with a tool called Low Freq Viterbi so that we especially produce high quality mappings around indel positions so this will improve the quality of indel calls later on we produce additional statistics on the mapped reads with sample stats we then add indel qualities with Low Freq again which is a previous fit for indel calling with that variant caller then we do primer trimming and this is where the Amplicon scheme and the primer scheme comes into play so we use IVAR trim to remove primer sequences from the mapped sequenced reads before doing variant calling with Low Freq and then we use a tool called quality map to report all kinds of quality metrics and produce nice plots on the fully mapped and processed on the final mapping result so to say the called variants are still annotated filtered and annotated with a tool called SNPF and this will add information to the variant calls about which genomic features of SARS-CoV-2 are affected by our variants whether given variant is for example a silent variant so only taking effect at the nucleotide level but not at the amino acid level or if it affects an amino acid what that change on the protein level would look like um yeah it does this questionable strand bias soft filtering where we said this is experimental and we ignored that file and finally from all the different mapping and read quality metrics we produced along the way we produce the final preprocessing and mapping report and this we can look at right now so we just click the display icon here and we're dropped into this multi qc report on all the different quality statistics that we obtained throughout this workflow and this is really advisable that for each batch of data you at least take a brief look at this report to see if there was any trouble with any of the samples in the batch here we're seeing coverage of the mapped reads on the SARS-CoV-2 reference and we're seeing that out of our 16 samples here most are shown in green and look quite okay so for example if we look at what percentage of positions in the SARS-CoV-2 reference are covered more than or at least 30 fold so with more or equal 30 reads in each of our samples we see that 99 and more or at least above 90 percent of all sites in the reference genome are covered more than 30 fold for most of the samples we're doing a bit worse for the sample here for the 502 sample where that's just just below 80 percent of reads but there is some trouble with at least three samples here according to this metric so one that is really poor which is this 5405 sample that only has five percent of the sites of the reference covered with more than 30 reads for the sample and then we have the second worst is this 4508 sample with 28 percent and we have 4506 which is also not particularly great we also see a median coverage and you see while for most samples the median coverage over all sites in the reference is several hundred fold this is just 188 fold for this sample here and it's really the median is zero for this sample so this sample is probably too poor quality to do anything useful with it it's 52 fold for this one and also zero for that other 408 sample here so there are some samples in this batch where we will really have trouble identifying mutations because if you don't have sequencing information then how would you identify mutations we can scroll further down and we see this illustrated in various other plots so particularly instructive one is this cumulative genome coverage plot which shows the fraction of the reference that's covered by a certain number of reads in each of the samples and you see that some really good samples here have like 90 percent of their of the reference sequences covered by there by more than 1000 of their reads so this is really excellent coverage but then you see that the problematic samples like this 505 sample has a very sharp drop in these coverage statistics so even at a 50 fold coverage only five percent of the reference positions are covered in the sample and for the for eight sample it's it's it's below 30 percent of reads covered at the 50 fold coverage so yeah there are problematic samples in there and if you just go with these lines you would say these four samples are the problematic ones 502 506 508 and 505 in that order of severe increasing severity you can also see how the gc content distribution in the different mapped reads is bit hard to interpret if you don't know what the default for SARS-CoV-2 in the genome would be you can see the mapping stats from sample stats you can see how many reads got aligned mapped in the right orientation and so on and you can see what the insert sizes for the reads in a pair would be so how large the sequence genomic fragments have been and you see a nice distribution here which is pretty much the same for all of the different samples so there were no problems in the library preparation sequence quality of the reads and how much the effect of filtering would have been not a tremendous effect of fast p in that case because quality was good originally already for read 2 which is on average usually a bit worse in quality in Illumina sequencing then the forward reads filtering also still doesn't have a very big effect good then yeah but most importantly we're really interested in this coverage analysis and to identify the samples that will be problematic in the further analysis of the data okay another interesting piece of information is actually whether the amplicon scheme is really what we think it is so if in the sequencing lab for example they're changing the primer scheme they're using for generating the amplicons and maybe they don't let you as the bioinformatician know that fact it could be interesting to inspect the output of IVAR trim so that's the fully processed reads for variant calling that's the IVAR trim runs and if we expand those 16 data sets and look at their data set details we can see the standard output produced by IVAR trim and that informs us about what it found and it says it could trim primers from our primer scheme from 33 percent of all reads this is probably okay so don't think that you should find primers on all of those reads because you're using the sample con scheme during sequencing normally the the amplicons after generation get fragmented again and they're getting sequenced as these shorter fragments so not every fragment sequence has to start in a primer and end in a primer from the primer scheme and that's why only about a third of the of the sequences needs primer trimming here that's also why the remaining 66 percent of reads started outside of primer regions but that does not mean outside of amplicon regions but probably inside and we want to keep those reads because they are just what you'd expect from the fragmented nature of those amplicons during sequencing what we want to drop though is the reads that really did not fall within an amplicon so those would be in real contradiction to what we assume based on the amplicon info and the primer scheme so if we find a read that extends or a pair of reads that extends outside of to outside of an amplicon something would be fishy and if we see this for many reads then we should worry that maybe we are not assuming the correct primer scheme to begin with however there's not much to worry here there's just 0.56 percent of reads that did not or or pairs of reads that did not fall within an amplicon and so that's probably just what's to be expected as sequencing artifacts or maybe there was a bit of an unamplified original genomic DNA in in the sequencing pool so 0.56 percent is really nothing to worry about but if this number goes up to several percent or tens of percent then chances are you're supposing the wrong primer scheme and we could check this for 16 samples but if the sequences come from a single batch of prepared samples then it's usually either or I mean either you are assuming the correct primer scheme or you don't so checking one sample occasionally from a batch is totally enough and then yes of course the the key output of this workflow is then these SNPF annotated variants let me find those again so here they are the final SNPF annotated variants those are in VCF format and if you look inside of them then this looks like this so basically in the body of that file you have one line per per mutation that the variant caller found and you have lots of quality statistics about the call that the variant caller made and you have those SNPF annotations somewhere annotated down here in one long long string and you can easily imagine that this is not very nice to browse for a large collection of samples so even without 16 we would not enjoy going through this list and inspecting all this output and making trying to make sense out of it so this is really the reason why we're running now the second workflow the reporting workflow that is currently running to to make this more digestible than the collection of the VCFs that said though this collection of VCFs is important raw output and it should be archived and kept because it has all the details that were ever known about your samples so from that point on we're really going to reduce information content and so it's good practice to keep those VCFs around should you ever have questions about this batch of data again in the future so let's inspect what the reporting workflow has been doing so far yeah it's making some progress and we're just waiting for it to finish scheduling now so that we can then start the third workflow the consensus construction workflow so the final job produced by the reporting workflow and the job that will create the variant frequency plot um has just been scheduled and so now we are not risking anymore to mix up data sets produced by different workflows in our history so we're ready to run the third and final workflow the consensus construction workflow by now you should be very familiar with this process how to import it into Galaxy go to the GF4GH servers type your query again that's organization IWC with the name to again we're looking for the consensus workflow this time and down here it is consensus construction we expand that we see the different versions we use the latest imported into our list of workflows and then we're ready to run this now this workflow needs the variant calls again so again for tiled Amplicon data this should not be the Strandbars soft filtered list of variants but it should be the final SNPF annotated variants collection so we'll inspect those VCFs again we can leave the workflow parameters here at their defaults but the consensus construction workflow needs two more data inputs to run one is the aligned reads data for depth calculation so this it needs because this workflow will try to mask out positions in the generated consensus genomes where a lack of coverage simply didn't allow calling of variants so meaning because of a lack of coverage certain positions in the con generated consensus genome might just be unknown or very uncertain and to indicate this uncertainty the workflow will incorporate ends into the consensus sequences at these sites instead of just guessing and putting some either reference or alternate nucleotide in there so it what it needs is our fully processed BAMs that went into variant calling in the first workflow and this is this dataset fully processed reads for variant calling with the primer streamed with realignment and in the qualities added so this serves as an input we can leave this depth threshold for masking at its default and then we need the reference genome again because this workflow will try to incorporate the called mutations found by the variation analysis workflow into this reference to produce per sample consensus genomes okay we've set things up and then we can run it and then ask before those new datasets will appear step by step in our history and we just wait for the final result in the meantime because they are basically ready by now we can look at the outputs of the reporting workflow and those are the three top most datasets here so the variant frequency plot will hopefully be generated in an instance but the two reports are there already the format of these reports is described in very much detail in the written tutorial so we really encourage to work through this at the same time and read about the format of these reports what I want to show here really is that these reports turn the VCFs into a much more consumable format for us humans so the combined variant report by sample basically takes all the variant calls from all these samples and aggregates them into one big list and it has a first column that indicates the sample that is currently getting for which the variants are listed so it first lists all the variants of the first sample and then further down all the variants of the second sample and then of the third and so on and so forth what you can see is this very nested long string that was initially in the info column of the VCFs is now extracted in a nice simple tabular format and so we can find the reference sequence allele at a given position 241 in this case and what the variant caller found instead at this position which depth of reads was available to make this decision what the allele frequency was so 99.7% of those 2586 reads confirmed this mutated allele what the impact according to SNPF of this position would be 241 is in the leader sequence so it doesn't affect anything in terms of coding regions which gene got affected transcript identifier so the name of the translated product and this is where the feature mapping file got used so we have now translated these NCBI refsec identifier into these nicer names yeah if there were other samples where this variant was found in and so on and then more condensed is the combined variant report by variant so this aggregates across samples now and only lists has one line per specific variant that was discovered so like this doesn't repeat but it only goes once through the SARS-CoV-2 genome lists all the variants what the effect would be at the codon at the mine acid level at the transcript level and then it lists the samples back here for which this particular mutation was observed right so these are those reports and then now the variant frequency plot is ready and if we go into this one then we'll see a nice overview plot and we can get rid of the tools bar here to see a bit more we can see visually separated from each other the three clusters we asked for so apparently there is one bigger cluster with lots of seemingly very similar samples inside of it and then we have two smaller clusters one consisting only of two samples one of three that are relatively different from the main cluster so each row here is one sample as you already guessed probably and then at the top the top row the colorful top row indicates the different genes in the SARS-CoV-2 reference genome and each cell here is a mutation a mutated position a position that is mutated at least in one of the samples in the batch so not all positions in the genome are shown but only the ones that are affected by a variant by a mutation in at least one of the samples in the batch so that makes this plot a bit more concise and the colors of the cells indicate the observed allele frequency so which fraction of reads confirmed a given mutation in a particular sample and if that fraction is low as here then that's always something to inspect like for this sample number two here there are lots of light colors in here meaning not all reads covering these apparently mutated positions in that sample agreed on that mutation but there were other reads that didn't confirm it and the tutorial goes into details how to make sense out of all this data and I don't want to show you here the interpretation and the data I just want to give you a quick tour of what's available and how you browse it so because the sample names and the mutation names down here are really hard to read in the overview you can use the zoom function of your browser to just dive in more closely and this is a svg file so it will actually scale as much as you want it you just need to move around here then now you can read the different the labels of the different cells so which mutations they refer to and you also have access to the sample names and you can compare this to other reports okay and in the meantime you can see that the consensus workflow makes good progress at least in terms of scheduling some jobs are also running and so we just wait for this to finish and then I'll show you how to the end result will be a collection of consensus sequences in faster format one per sample and then for convenience to run then additional tools like next clay and and pangolin for lineage assignment all these fastest are also joined into one multi sample consensus faster with all sequences in one file and this will then serve as input to those lineage assignment tools and once this data set is ready I'll show you how to run these tools on top of of the consensus workflow all right that has worked nicely all our jobs are finished and the final two results files and the result file and the collection are both green and inside of it or inside of the collection all 16 individual faster files are green so we're ready to proceed with lineage assignment for the purpose of lineage assignment there exist two widely used tools and galaxy offers both for your own needs so one of them is pangolin and it's relatively straightforward to run through galaxy saves you a lot of the install hassles of pangolin assignment data and so on that you would normally experience when you try to do this from the command line it asks for a possibly multi multi sample consensus faster file so that file should contain all the lineages you want to all the consensus sequences you want to have lineage assigned and yeah that's the top most file produced by the consensus workflow just now pangolin in version 4.2 up to this version still supports two analysis modes asha or pangolin asha is the de facto standard nowadays and pangolin more or less un-maintained so just stick with asha probably pangolin will go away soon in upcoming new releases of pangolin then you have couple of options here where the pangolin data could come from so to assign lineages pangolin needs knowledge about all previously determined defined lineages and these come in the form of a pangolin data package and you have various options here you can use the pangolin data version that was shipping with the tool when it got released with this version of pangolin when it got released so that would be pangolin data version v1.17 in our case you can choose a specific pangolin data version cached on your galaxy server that is if the galaxy admins have installed different versions of the package you can use that and on use galaxy.eu you find lots of them this is nice because it also gives you for reproducibility access to previous releases of the data but as you can see currently the latest we have installed on use galaxy.eu is v1.17 and that's the same that was shipping with the tool so we can just go with the simpler version then less recommended is the download latest available pangolin data version from web so this pulls down the very latest release of the data from github where the data is hosted that's working but it hinders reproducibility because when you do it with the same setting when you run the tool with the same setting again it might pull in different pangolin data so this is really only recommended if you have a brand new batch of data where you suspect this might really contain the latest lineages that can defined by the pango people and i really need the latest pangolin data to assign those lineages correctly so for our purpose this is data from already one and a half years ago the very early batch of omicron data this what we're analyzing here so definitely this pangolin data version that was shipping with the tool when it got released is all we need here same goes for constellations just so that's a package that pangolin uses under the hood to do the assignment and and and that is just different in that it gets less frequently updated so its last update has been a while ago already so the version shipping with the tool is still the current one and that's fine we do not need to change any of the default settings except that we might want a nice header line in our output so that we know what all the different columns mean that pangolin is going to produce in its report and then yeah we have an input file we have an analysis mode and we have included a header line and that's it we run the tool and that's all we need to do really the competitor of pangolin or the complementary tool to use is nextclade nextclade works quite similar it also asks for possibly multi-sample consensus faster file and will assign lineages to all the sequences found in that file nextclade also has support for some other viruses limited support for monkeypox for certain strains of influenza A and influenza B viruses so you have to choose the organism and the default choice is a good one that's SARS-CoV-2 you have the same option to use database versions cached on the galaxy server or a version downloaded from the web freshly so in this case just to demo that i'm going to use the download latest available database version from web option although it goes a bit against the reproducibility idea we want the tabular format report this is the equivalent to the pangolin report there would be more options here for nextclade you can of course you can for example output the aligned sequences so all the sequences in your batch in a multiple sequence alignment and you can have instead of a tabular report you can ask for it in jason format which is more machine possible and you could also have a tree file in jason format but tabular format report is all we want for this tutorial like for pangolin we want to include the headline in the output and then we're done um yeah and the output of these two tools is discussed quite nicely in the tutorial so again go through all the details there um and try to make sense out of the things reported in those files the tutorial goes a bit further and creates aggregated reports out of these two files so using a tool called data mesh i'm not going to show you those steps here we're also not going to inspect the outputs much here it's just really described very well so just be careful here there are two data mesh versions installed on usegalaxy.eu so if you click on them you will see that one is 1.8 and the other one is 1.06 so a quite outdated version should so just go the first one for the first one and be sure you pick the 1.8 version when running data mesh and this will let you like condense further these outputs of nextclade and pangolin for nice comparison um yeah with that i'm at the end of this brief demo and i wish you good luck following this tutorial and producing the same kind of outputs that i'm still currently producing here um yeah and um hope you found it instructive and have a nice day bye bye