 welcome to day three of this workshop. My name is Wolfgang Mayer from the European Galaxy team and I'm now going to give you an overview of what we have prepared for you today which is actually quite a program I think and from a galaxy centric point of view probably the highlight of this workshop. So looking back at the last two days so these were more or less there is a general introduction into Galaxy and into Galaxy is a platform for sequence data analysis. You have learned by now hopefully to work with Galaxy productively to upload your data work with different histories, run analysis jobs and inspect results. So that was kind of the Galaxy 101 part of it so far. You've learned about collections which are Galaxy's fundamental building block when it comes to processing many samples in parallel and which therefore are quite central as we learned today to the analysis of SARS-CoV-2 sequencing data. You've heard about concepts of NGS data analysis and worked through several concrete tutorials about read mapping quality control and as a first application of these concepts to viral sequencing data you learned how to remove human reads from viral sequence samples with Galaxy and all along this way you've encountered quite a good number of different Galaxy tools already and so hopefully probably you feel a bit confident already to combine them into more complex analysis. Still so far all these topics we could have taught also in any other Galaxy workshop but this is what's going to change today because today it's all about SARS-CoV-2 sequence analysis and that SARS-CoV-2 sequence analysis as it's done all over the world right now has two outstanding characteristics that I want to explain next. So first of all all that SARS-CoV-2 sequencing data currently available is rather similar that there are at least if you're not getting lost in details just very few sequencing protocols for sequencing that viral genome so essentially you have whole genome sequencing approaches which need a bigger amount of starting material but which can provide unbiased data across the viral genome and you have the ampliconic approaches in which the viral RNA or its reverse transcribed DNA first gets amplified with specific primer sets so that the combined PCR amplicons then together cover the whole genome again and so this later approach is of course the dominating one in the current phase of this pandemic because it works with really low amounts of starting material like the amounts obtained with diagnostic swaps for regular PCR diagnosis and in terms of sequencing platforms there are really just two that are worth mentioning and these are Illumina sequencing on one hand and then Oxford Nanopore Technology ONT sequencing on the other hand and of these Illumina sequencing of course comes in two flavors so single-end SE and paired-end sequencing and the second characteristic which is kind of opposite to to this small amount of platforms and technologies protocols the second characteristic is the sheer amount of SARS-CoV-2 sequencing data that's available so no other nucleic acid ever in history has been sequenced as obsessively as this tiny viral genome by now there are several hundred thousands of sequencing data sets available through public databases and that number keeps growing very rapidly every day and now combined this relative uniformity of the data achieved through this low number of data sources basically and its excessive amount they really beg for automation so what we need is a robust and reproducible and ideally agreed upon way to analyze the information that's in all of the data and well very clearly that cannot be achieved by any by hand analysis no matter on which platform you use so what that means in galaxy terms is that you absolutely need workflows instead of doing this manual combination of tools to achieve full analysis of something and then these workflows should process data in parallel in batches simply because hardly anyone sequences just a single viral sample at the time anymore and even workflows do not provide enough automation anymore for this amount of data that we're seeing being produced with SARS-CoV-2 genome sequencing nowadays so that's because somebody still needs to run all these workflows and feed the data into them so the next step is then clearly to automate even these workflow runs so to launch them automatically as new data comes in and this is something that you can achieve via Galaxy's API so you learn today about command line tools like planimo and the underlying library biobland that allow you to to launch workflows via the command line with a simple command line call and that's so that's what you hear about in the first two hands-on parts of today workflows for SARS-CoV-2 variant calling generation of batch level reports and consensus building generation of viral consensus sequences and using the Galaxy API to automate such a workflow run from the command line and then in the last part of today you see how you can go bind these two things together and how to turn in this way galaxy into a really powerful SARS-CoV-2 sequence analysis and genome surveillance platform and with that let's take a quick look at the individual topics for today so the recommendation would be that you start with this hands-on material here about variant calling reporting and consensus sequence construction with galaxy workflows this material is quite similar still in style to the ones you encountered so far and will demonstrate the usage of the best practice workflows for SARS-CoV-2 sequence analysis that have been developed over the last year so by the COVID-19 Galaxy project so here's a schematic overview of what expects you there so the different kinds of input data so Illumina or Oxford Nanopore technology sequence to whole genome sequencing or ampliconic data they're reflected in several flavors of a variant calling workflow that takes sequence reads and produces variant calls one one file per sample in VCF format all these variations of the workflow are entirely collection based to enable efficient parallel processing of batches of input data and then downstream of them you have unified processing of these VCF files per sample by just one reporting workflow and by one consensus building workflow so these are agnostic to it's the type of input data and so they just see the VCF produced by any of these upstream variant calling workflows and can work with all of them and the of these the reporting workflow turns per sample variant information into batch level reports so it's kind of aggregating the results that produces summary plots for them and the consensus building workflow where it builds high quality consensus faster sequences out of prefitted per sample variant calls and these consensus sequences you can then use as input for for example viral lineage assignment so this first tutorial will walk you through all of this and now importantly here the variant calling workflows specifically that you will use right at the beginning of this tutorial they will run for rather long time because we'll make you work here with real sequencing data so be prepared for an up to maybe three hours break while the variant calls per sample get produced and in the meantime what you could for example do is to already start exploring the second tutorial of today and this one's quite special among our hands-on material here because in it you learn how to run a galaxy workflow not through galaxy's graphically user interface but from the command line instead and specifically you'll use the command line package planimo to run a galaxy workflow and specifically one for consensus building and viral lineage assignment from already called variants in the form of a collection of VCF files so you recapitulating a part of what you will be doing in the first tutorial in the graphical user interface through the command line in the second tutorial this tutorial however is definitely going to be easier to follow if you have at least some very basic command line experience already and also if you want to do the hands-on part of it you will need access to a system with a proper terminal on it and you will need to install planimo on that system so for Windows users that means that you will have to install the Windows subsystem for Linux the WSA 2 which you'll find is easy enough if you're on Windows 10 and this will then let you use a Linux terminal from inside Windows but be prepared that this might take a bit of extra time to get set up and now if that's all a bit too daunting for you at the moment you can of course also just watch the video and see if this is generally interesting for you and then if you find yes it is then note that all the material of this workshop is going to stay online so you can always come back to it later on and then if you have a bit more time and then do this in a quiet hour or two at some other day and then in the final demo for today you will see how you can combine the best practice workflows from this first tutorial with planimo and scripts with we have developed this year to really turn a Galaxy server into a SARS-CoV-2 sequence analysis and genome surveillance platform with high automated throughput and you might actually be surprised to learn how easy or rather easy that is to establish this part here of the figure and then this last topic also already touches a bit on the topics of the last day tomorrow which we will cover then a lot of the ecosystems around Galaxy and some projects we're collaborating with to bring SARS-CoV-2 data and the analysis of it to as many interested people as possible so this involves for example the DNA the new European nucleotide archive to which you can upload your sequencing data now directly from inside Galaxy and we'll also cover several downstream databases and dashboards for which our Galaxy analysis results actually serve as input but yeah this is as I said the topic of day four and so I'm not gonna steal more of your time from today and hope you get the most out of today's material and have fun