 Hello, everyone. Welcome to this talk about the use of Galaxy in SARS-CoV-2 sequence data analysis. My name is Wolfgang Meyer. I'm working for the Galaxy Europe team, and I have the honor to present today our 2021 update on our genomics efforts within the COVID-19 Galaxy project. So this COVID-19 Galaxy project is an open project. Anyone with an interest in SARS-CoV-2 data analysis and Galaxy is welcome to join in. And it aims at turning Galaxy and specifically the useGalaxy.SARS instances. So the big public Galaxy service into a global platform for the analysis of SARS-CoV-2 data, be it within genomics, cheminformatics or proteomics. So these are the three big scientific fields that we're currently addressing, and work involves providing tools and workflows, maintaining them for SARS-CoV-2 data analysis in these different fields, spreading knowledge about best practices for such data analysis, and providing training material for the community, making sure all these workflows and tools are operable on different public Galaxy instances, and so forth. So what is our motivation in the genomics part of this project? Well, like many people in the Galaxy world, we like to advocate transparent reproducible and reusable analysis, in this case of SARS-CoV-2 genomic data, and this kind of analysis should be possible for everybody, independent of whether they have hardware resources, their own hardware resources at hand. So basically the whole point about Galaxy as usual. This was a starting point in 2020, and since then I would say in 2021, or end of 2020, we definitely added the second motivation to the list, so make sure Galaxy keeps meeting the shifting analysis needs of the field. What do I mean by that? Well, when we started out in 2020, when SARS-CoV-2, the SARS-CoV-2 pandemic hadn't even started really, so that was in February 2020 when we kicked off this project, people were interested still in assembly of the viral genome. There were not so many viral isolates from China at this time, very few from outside China, that an initial workflow for variation analysis, so the SARS-CoV-2 reference sequence was new at that point, and so we started doing variant calling against that reference, and different other types of evolutionary analysis, even early selection pressure analysis on the spike protein. But since then, the whole analysis needs of the world have shifted a bit, I would say. So Illumina whole genome sequencing was the method of the early days, definitely. So this of course has the disadvantage that it requires a lot of starting material, so RNA in that case SARS-CoV-2 being an RNA virus. So the typical starting material is either bronchial alveolar lavage fluid from patients, so it requires very invasive treatment of the patient, or cell culture, which requires a long time to passage the virus multiple times through multiple generations in cell culture. And so Illumina whole genome sequencing doesn't really scale well to the throughput needs of the world today, when national genome surveillance projects aim at sequencing like 5% of all SARS-CoV-2 cases in a country. You cannot achieve this with this method. So people have started relying more and more on tight AmpliCon approaches, and the most notable of these is certainly the Arctic Networks approach with their dedicated primer set for SARS-CoV-2. So in this approach, you have two sets of primers, two pools of primers, around 50 primer pairs in each of those two pools. You do multiplex PCR amplifying large parts of the viral genome in each pool. One pool will be depicted in green here in this illustration, one in violet. And then when you combine these two PCR pools before sequencing, you get reads from all across the viral genome and can actually recover the whole viral genome sequence from this mixed set of tight AmpliCons. The big advantage here obviously is that it requires very little starting material, so even diagnostic swaps from patients will do, so meaning essentially leftovers of regular routine PCR tests. And so this increases the throughput of the whole genome sequencing effort of the world tremendously, of course. And so what that meant for us is that of course we needed to provide workflows for analyzing such kind of data within Galaxy. So first workflows that we developed were actually targeting Illumina whole genome sequencing data, either single-end data or parent data. And the basic outline of the schematic view of these workflows was that we used, and we still do, use EWA-MEM for mapping of viral reads. We then do variant calling with low-freq, a very sensitive variant caller that is capable of detecting even low frequency alleles reliably within viral sequencing data. And then we annotate those variants with the special COVID-19 release and then to adapt this workflow to Arctic pre-amplified data, the essential bit is actually that you include a tool called IVAR, specifically the IVAR trim functionality of the suit into the workflow, because that will remove primary sequences from the amplified starting material, because of course those primary sequences would never contain variant bases, you will only find Y-type sequence in those primary sequences, so they could hide true variants in your sequencing data and removing those primary sequences is essential. Then with the advent of the Arctic protocol for SARS-CoV-2 sequencing, people also started using Oxford Nanopole sequencing technology on such pre-amplified data, so we had to come up with yet another workflow for calling variants from this kind of data that changes what we still use IVAR trim for the purpose of removing the primers, but we now use Minimap2 as the readerliner and Medaka for variant calling and the Reliance NIPF COVID-19 release for the variant annotation. Now of course all these workflows depicted only schematically here are real full-blown Galaxy workflows in reality, so the real picture is far more complex, so this one here shows the Illumina pad and analysis workflow from Arctic pre-amplified data, and all of these workflows are relatively complex under the hood, but the schematics are the ones shown in the previous slide. So we now had three different types of workflows for variant calling, one for analyzing Illumina whole genome sequencing data, another one for analyzing Illumina data that was Arctic pre-amplified and another one for the same kind of ampliconic data but sequenced with Oxford Nanopole technologies. So what all these workflows have in common is that they try to call as many variants as they reasonably can, so these are very permissive variant calling workflows that include all possible variants essentially. We then use soft filters using the VCF info field to flag the most questionable variants which most users for most analysis might want to get rid of, but the idea is that we keep as much information in those VCF files around as possible in order not to discard any data prematurely. And then another characteristic of these workflows is they are entirely collection-based, so we try to analyze data in batches from raw sequencing reads to these VCF files to increase throughput on the Galaxy platform. And then underneath or as the next step after these variant calling workflows we built two additional workflows, one for variant reporting and one for consensus building that then work on these collections of VCFs produced by the upstream workflows and generate batch level output for each of those batches. So the reporting workflow will generate these overview plots across all samples in a batch that let you quickly pick up interesting patterns of variants in the batch as a whole. It also generates per batch variant reports and the consensus building workflow generates multi-sample consensus faster sequences like people need them for genome surveillance projects and for lineage analysis for example. So to find out whether these are alpha, beta, gamma or delta lineages of the virus. So I'm very happy to be able to say today that we got all of these genomic workflows for SARS-CoV-2 data analysis into the newly formed intergalactic workflow commissions GitHub repo and from there they are pushed to docstore as the collection of SARS-CoV-2 workflows of the intergalactic workflow commission there. They are also pulled into the workflow hub.eu as another registry for workflows of all kinds. And so this gives us very high public visibility for these workflows. It allows for proper version control of these workflows. We get some extra annotations for workflows and of course we get very good quality reviews from the intergalactic workflow commission and all these developments of being able to push your workflows to the intergalactic workflow commission to docstore and to workflow hub to share them with the community are not only a great benefit for this project but also for anyone else working with Galaxy. And so I can only recommend the next session which will focus entirely on Galaxy workflows and especially two talks there in one from Mario's introducing the intergalactic workflow commission and the other one from Ignacio about the workflow hub as a registry for workflows. So if you're working with Galaxy workflows, these are really talks that you should listen to. But now back to our project with that we think we position Galaxy quite strategically and well in a central position for SARS-CoV-2 sequencing data analysis. So whatever your use case we kind of support it. So you take your all sequencing data, you analyze it within Galaxy up to the level of VCF files and then you're free to choose. So you can export this VCFI to downstream variant analysis software or providers. So one we're collaborating especially close with is the viral beacon project which I will say a few words more about later. You can also run the reporting workflow and generate these overview plots and reports and then manipulate those further and visualize the data they're in with, for example, Jupyter notebooks. What we have been using lately are observable notebooks which you can think of as the counterpart of Jupyter notebooks in the JavaScript world. So these make it especially easy to build websites that display all the data that you've analyzed. And they are really a great tool. We just recently discovered with the help of our collaborators. And then you can also run the consensus workflow and this will give you the data that most bigger project surveillance initiatives are looking for. So this is the data you want to submit to the Gisei database that is considered go-to database by many in such projects today. You can submit this data to genome surveillance initiatives. You can do lineage assignment using Pangolin and next-clade tools on these faster sequences. And we now have Pangolin and next-clade also within Galaxy. You don't even have to leave Galaxy anymore to do this lineage assignment. We then visualize across batches in various ways. So as I said, we have this collaboration with viral beacon project which has their own dashboard showing publicly available variant calls for Suscov-2 data. And we are happy to be one of the major data providers these days. And we get those visualizations from them for free this way. We also host our own interactive notebook and observable notebook in JavaScript, as I already mentioned, which lets you explore all our data in great detail. And we're also collaborating with national genome surveillance projects like this, the one from Estonia. For example, that then import our VCF and reporting data into their own national dashboard. Our own dashboard, this observable notebook reachable here under this address, lets you explore variants in really exquisite details. But then most recently our primary focus was on scaling this whole procedure. And why is that? So I emphasize that we are analyzing data in Galaxy in batches. And so for each batch of data, maybe a few hundred samples, you would have to run one of our variation variant calling workflows. You would have to run the reporting workflow and the consensus workflow if you want the full data analysis. So these are three workflows that you have to launch one after the other. And on the other hand, interactive dashboards on the national or global scale expect really massive amounts of samples. We have our own observable dashboard, for example, now lists 96,000 and a bit more samples. So probably above 100,000 samples when you're seeing this talk. And so obviously there's a gap in there. If you're analyzing batches of data of a few hundred samples and you want to go up to 100,000 samples, you have to launch quite a few of these workflows. And if you do this by hand, you have a problem. So the idea was to come up with automated variant calling workflows and we devised the system composed very simply of just two components. Basically, one is a series of bot scripts for downloading batches of sequencing data from public sites. Right now we're using mostly the European Nuclear Tide Archive, for then launching our appropriate workflows, variant calling, reporting and consensus sequencing sequence building and then exporting the data from Galaxy. And that's the second component to a massive FTP server space. And so the bot scripts that drive the whole thing can be found in these two different GitHub repos. All of this is what the bot scripts do is they use Galaxy and they are powered by these super cool libraries Planimo and BioBlend, which really come in, make this task of automating workflow execution in Galaxy quite simple. And then we have this FTP server space, the address of which you're seeing here, that's hosted at the Barcelona Supercomputing Center together with the Center for Genomic Regulation, which is also the organization that hosts the viral beacon project. So this collaboration is really great for us because we can now automatically analyze as much data almost as we want on any kind of Galaxy and we can then automatically push this ready analyzed data to this FTP server to give everyone easy access. And the whole system looks like this. So when anyone submits to, for example, the European Nuclear Tide Archive or makes data public in any other way, we can retrieve the data in batches through our bot scripts using then the workflow hubs or doc-store versions of our workflows, so version controlled workflows of high quality. We analyze that data within Galaxy and whatever is ready we push to that FTP server. And from there we populate our interactive observable notebook, the viral beacon project populates their dashboard and any national dashboard is also free to use the data on this FTP server to populate their site. In addition, we provide an extra GitHub bot that allows users to simply do pull requests with URLs of datasets they would like to have analyzed by us and then this triggers an automated execution of the workflow runs as soon as we merge such a pull request. So the really cool thing I like almost most about all of this is that it's not one way. So we're not just depositing data on that FTP server, but we're also keeping a JSON file pass, very small part of it is shown here that links every individual sample analyzed and every dataset on that FTP server back to Galaxy histories on any public Galaxy instance that has processed the data. So you can look into this file, look for the sample you're interested in and then find really the complete analysis flow from the raw sequencing data to the final result files that are on the FTP server and redo everything if you want to change parameters play around with the data. So it really makes this large scale data very manipulatable. Currently this complete system is deployed on useGalaxy.eu useGalaxy.org has deployed this main branch so analyzing public data in an automated way on useGalaxy.org the request based analysis is currently only available on useGalaxy.eu and the Spanish useGalaxy, COVID-19 useGalaxy ES site is just starting to deploy the whole system and analyze data with it. But the whole thing and that's the important point is ready to be deployed right now and here on your instance if you have one and so I highly encourage you to also come to my poster demo session to my well whatever you like to call it to my live demo within session A of the poster demo sessions today. And with that I want to thank all the people involved in the project and of course all this would not be possible without funding for the powerful infrastructure of our public service so lots of these funding agencies and infrastructure providers have been essential for this work. And on the right hand side I have collected for your convenience all the links that were scattered throughout the presentation and you might find useful for further reference. Thanks for your attention.