 Okay, yeah, a bit of a long title, but I'll show you how all of these components work together to power our SARS-CoV-2 genome surveillance effort in Galaxy. So yeah, I guess most of you know me, I'm working for the Galaxy Europe team at the University of Freiburg in Germany. And I'm presenting today only one very small aspect of all the COVID-19 efforts that we undertook in this project. So the one problem I will focus on is this one that you nowadays probably all have seen a lot in various newspapers on websites and so on. So this is how you actually derive these plots that you see on the right side here with the SARS-CoV-2 lineage stats. So how do you get to these frequencies of SARS-CoV-2 lineages over time? What data do you use for that and how do you analyze that data? You're all familiar with this picture now that wave after wave new viral lineages appear and take over and yeah, and this is probably not going to stop anytime soon. But how is this data generated? So the common way, typical way to do it nowadays is through these tight AmpliCon approaches where you amplify the reverse transcribed viral RNA with an overlapping panel of primer pairs. So you do a highly multiplex PCR reaction with two aliquots involving all of these primer pairs. And then when you combine the PCR products again and sequence them all together, you will get sequenced reads derived from all of these AmpliCons. And this is how you manage to get from a low amount of starting material a complete genome sequence across the 30 kilobases viral genome. Yeah, and that's your sequencing reads. But then apparently there's a long way to go from these sequenced reads to these plots on the right side, right? And there are many data processing steps. And somehow those receive a lot less public attention than the sequencing protocols and the final result. That's the bioinformatics part. And probably because it's complicated, it's not so much the stuff on use. But it's the stuff for Galaxy, of course, that we want to enable. And this is the kind of workflows I'm talking about today to go from such AmpliCons and their sequence reads to this lineage information. And so even if you don't go into very much detail, at least these are the major steps involved. So you want to map those sequenced AmpliCons to the viral reference genome. You want to do primer trimming because as you can see in this image, the sequence reads carry those PCR primers at their ends. And those PCR primers were designed to match the viral reference genome. So for the next step, when you do variant calling to look for mutations in the viral genome, you need to ignore those primer pairs, because they will actually always be reference sequence, because that's how they were designed in the first place. So you want to do that, then you want to do variant calling. And I wrote mutations here in parentheses because in the field, I mean, you have viral variants and that's a bit confusing because you also have mutations, which we as geneticists call variants in the genome. So the term variant is a bit misleading for virologists. So here we're still talking about mutations getting called. And then the last step is to build a so-called viral consensus genome where you take those mutations that you detected and you incorporate them into the original reference sequence to get a faster file that's as close as possible to the actual viral genome that you survey. And then tools like pangolin or exclate, they can take such a consensus genome sequence and do lineage assignment for it. So they can then tell you, well, looking at this faster sequence and all the sequences seen before, this seems to be BA5 or very close to it. And this one is BA4 or this is something novel and so on. So that's the kind of things we want to do with our workflows. And it's quite important that you have standardized workflows here because the hallmark of all these steps is that they reduce information content. You start out with like dozens or even 100 megabytes of sequencing information, but then you throw it basically all away during the data analysis. And the result in the end is just the 30 kb reference consensus genome for your viral isolate. And then you do a lineage assignment and basically just give it a short string as a label what it is like BA5. And so if you're not careful to document all these processing steps, then you just have some final result, but it's very hard to say whether it's meaningful or has been derived in any good practice way or so on. And that's why we thought here's something where Galaxy really shines making these data processing steps reproducible. So what did we come up as a solution where we gave this whole problem various iterations and thoughts and finally came up with the scheme where we provide upstream a couple of workflows, some of them upstream and only dealing with the variant calling mutation calling problem. So we have three for them for various kinds of input data for AmpliConnix. So using this tied AmpliCon approach, the most popular one is this Arctic scheme of primers sequenced with Illumina, the same thing sequenced with Oxford Nanopole technology. And then we also have workflows for calling mutations from Illumina whole genome sequencing data where you do not do this preamplification step with the primer scheme. And all these upstream workflows have in common that they produce mutation calls in standard VCF format. And we also do annotations of these variants with SNPF so that we know what functional genomic effects on the virus these would have. And then we have additional downstream workflows that are doing reporting on this variant spectrum found for a batch of isolates and the consensus building workflow that actually builds then these consensus reference genomes that I've been introducing and that are the source for pangolin or next clay lineage assignments. So why did we break up the thing into so many different workflows? Well, the reason is not every user of these workflows will want to do the whole thing. There are users who are just happy with producing a VCF of what's there. There are the users who want to go all the way to these consensus sequences to feed them into pangolin or next clay and there are still other users who want to analyze that variant spectrum further but who find VCF too complicated for their scripts to pass. So they would want some tabular reports and we also as part of that reporting workflow produce such overview plots about all variants and their observed delete frequencies that are found in a batch of isolates of viral samples. And so users can combine components that they really need in their analysis and it also gives us more flexibility when we do updates to these workflows. So like say we find a bug in our consensus building workflow which actually has happened already in the past. We only need to modify that one workflow and then to reanalyze data that we've previously analyzed the wrong way let's say or in a suboptimal way. We do not have to run the whole expensive analysis again but we only need to run the consensus building workflow which we changed on the outputs of the upstream workflows again. So it speeds up reanalysis also. Okay and then yeah this is what the upstream workflows look like. So for mapping in the Illumina case we use BWA-MAM, we use low frake for variant calling, then SNP-F for variant annotation and the IVAR suit of tools for this primer trimming thing. In the case of O&T data we use minimap2 as the mapper and medaka as the variant caller otherwise it's pretty similar. And yeah then come these downstream workflows afterwards and they end in these tabular variant reports per batch and in a multi-faster consensus sequence for all isolates that went through this run of the workflow as a collection. Then we put all of these workflows through the IWC, the intergalactic workflow commission on to Doxtor and workflow hub and this makes the whole thing highly reproducible. So any users interested in using these workflows they can go to Doxtor for example import from there into their Galaxy instance of their choice and run something at a defined version at a defined release. And this is actually quite convenient now if you look at the Doxtor example for example you can go there select one of our workflows you list get a list of all the different releases we have for this particular workflow. You can say I want to run this on one of the big three Galaxy instances or I have a public URL to my own preferred instance and I want to launch this workflows there so you can choose use galaxy.eu for example you're taken to the Galaxy Europe web interface and there you see those same versions listed you can now pick one for importing and you will have that imported into your list of workflows and can start working with it. Conveniently now you're also getting this nice little green check mark here that says you that tells you this is exactly the version main of this workflow from Doxtor and nobody has changed it. So if I open this workflow now in the workflow editor do some changes and save it again that check mark will disappear and I will know okay careful this is a modified version not the one that's officially released by the IWC through Doxtor anymore. Yeah so this is there to ensure reproducibility and this all works nicely but now we're having kind of a new problem we have all these different workflows to run and if you think of SARS-CoV-2 genome surveillance the issue here is that we are not talking about analyzing a batch of 100 samples or so and be done this could be nicely done in one workflow run through collections but we want to analyze hundreds of thousands of samples as they come in and their data gets deposited as raw sequencing reads in public repos so to scale this you really don't want to manually execute three workflows in a row for each new batch of data that comes in and you could also not really keep track of all the outputs that way so we had to come up with some automation of the whole thing and this is now where Planimo and Biobland and as you will see also the tags come into play if you don't know about how to use the API through Planimo to execute workflows from the command line you can go to this tutorial that mostly Simon from Margaret has written last year it shows you the basics of using Planimo Run with a so-called job YAML file to execute a workflow so the essence is you write this YAML file where you declare the input data sets to the workflow and possible workflow parameters and then you do Planimo Run give it a workflow ID of your workflow on your Galaxy instance pass in the Galaxy URL in your Galaxy API key and the YAML file and say Planimo run this workflow with these input files from this YAML file now that's the essence and so for automation we're making well use and we abuse maybe the system a little bit in that we are using command lines in such a job YAML file to in a in a template version of such a job YAML file so we're putting lots of information in these command lines like where to run this workflow what the workflow ID that we want to run on the server would be what the collection name that's built as the first step is going to be called what the history name that's created during the workflow run should be and some templates of data sets that we will download as parts of scripts so we have a script that basically can download SARS-CoV-2 sequencing data many samples at once a batch of data from any public repository like the ENA and then use BioBlend to download that in a structured way as collections into a Galaxy history and that BioBlend script that will download all the data and build the collections will then put the actual dataset IDs and the batch name for the analysis which it takes from the batch identifier from the ENA will put this here instead of those template variables in these command lines and then the actual job YAML part that Planimo Run will see has where the general input files like the SARS-CoV-2 reference sequence live so their ID on the Galaxy server and so on and after the data has been downloaded the downloading script will extend this template file with the collection information about where the raw reads now live on Galaxy once it knows that information so we start out with this template file we call our BioBlend dependent script that will do all the downloads for us and that script in the end will fill in all the missing information into this template and then the result of this will be passed as the job YAML file to Planimo Run and that will execute the analysis workflow for that batch and so in the end we have couple of batch scripts that are used for the variation workflow launching for the report workflow launching for the consensus building workflow launching and also for exporting stuff to an external FTP server and those all start out with these template files and in the end after calling Planimo for doing certain housekeeping things will pass that final job YAML file to Planimo to execute the workflow that they want to run that's how this is all public this lives in this use Galaxy minus EU in a Cog UK workflows repo so you can have a more detailed look at that if you're interested and essentially then we're running these batch scripts as current jobs or in our case on a Jenkins server at regular times and they will just grab the latest input data downloaded onto usegalaxy.eu or usegalaxy.org launch workflows with Planimo Run and wait until the results are there and then now come the text that's the last thing I haven't explained yet since these are just current jobs or Jenkins runs how do we record state of what has been done yeah which data has been processed which has already been going through the variation workflow which needs consensus runs and so on and we're using history tags for that so basically whenever one of these scripts runs it will put a specific tag on the history it creates so for example if the variation script runs and it launches the variation workflow will generate a new history and will put a tag on it that says bought go consensus and bought go report so informing the other scripts that next time they run this is a history that has the input data for running the reporting workflow or running the consensus workflow and then once these workflows have run and their batch scripts are complete they will change this tag to consensus bought okay and report bought okay and next runs of the downstream workflows will know this is a history that has already been processed and that doesn't need processing anymore okay a bit complicated feel free to ask questions about it if you want to afterwards um and with the system we built then this powerful automated system where we're regularly downloading the latest data from the ena from selected saskoff to genome surveillance project national surveillance projects we're taking those workflows from workflow harbor dog store with the defined releases we're using our workflows through these bash scripts to execute them automated on the data with these workflows and then we're exporting i showed you this one script the export script key results files like bams vcf and faster files to a spanish um fdp server at the basaluna super computing center and from there our data galaxy then acts like a data provider to these downstream projects like the viral deacon project at the crg in spain for the ucc genome browser where we have a track nowadays for our galaxy data and for our own interactive observable notebook where we show our data interactively um to anyone's interested like a dashboard like also many national genome surveillance projects have it and um yeah this is also where the modularity of things comes into play again so different of these downstream consumers expect different files right if you viral beaten for example is basically interested mostly in our vcf files and we have them already from the first workflow run um the reports are interesting for the observable dashboard and for the ucc genome browser where both of which do not want to deal with the full information in the vcf but want a clear tabular report of things um these workflows are also used in estonia for example by their national genome surveillance effort um and they are interested in the consensus fastest to run next strain pangolin on them you can also do this inside galaxy um so different users different needs and that's why this modularity is important yeah and in total we have now analyzed about 400 000 samples so far and all the information about them is on the spanish fdp server the address is given there um two files of importance are these all variants file which is basically a tabular report of all the findings we've ever made um and this jake surveillance jason file which is interesting if you really want to take a very deep dive into the data because this actually in jason format encodes the information in which batch each ena sample has been analyzed in and where the analysis histories for this particular batch live on public service for example for this reporting workflow run you see that that's this history on use galaxy.eu and you can then actually go to this history and inspect all the intermediate data how it was actually produced at runtime so if you really are interested in very very detailed questions then you can follow this up and this is also great for us if we find errors in our analysis later on we know which kind of which data has been analyzed with which workflow release and so on and we can reanalyze certain data and compare the differences between the analysis runs so this file is quite important but very specialized but i think no other project has this information trackable in this kind of detail yeah and then outlook of course currently we're interested in expanding this or making it compatible with other pathogens i mean saskof2 is certainly not the last pathogen of concern that will come around there are many candidates for triggering a new pandemic at some point and so we want to maintain our existing workflows still improve them as new tool versions for analysis tools are released adapt the workflows to other pathogens of concern this is not really super super difficult to do but every virus is a bit different in its genomics so they all have highly highly specialized genomes and sometimes you just need to tweak the analysis a little bit to cope well with one particular genome yeah and of course we need to not we i mean the big public galaxy instances are probably fine with that but virology departments of course also need to maintain their infrastructure invest in galaxy and and get familiar with how the whole framework works they still have time to prepare hopefully a couple of years but that's the thing we need to promote yeah and with that i want to thank all the people involved that's a lot of people of course i mean it's a big project so anton most of all who was constantly pushing this project further beyond for well assigning me to it and always be helpful a long way simon who did a great job in setting up this automation things getting things merged upstream into planimo to make it do exactly what we wanted it to do marius for constant technical help and for all the great investments into i w c um yeah and not going through the whole list but really thank you to everyone involved also to john of course for building planimo for nicola for maintaining and having created biobland those are very essential components um yeah and if you're interested in everything i said in this talk here all the links on one slide so everything we've created um so you can everything's open source public so explore good um that's it i'm happy to take questions yeah thank you so much for that presentation okay that's awesome work are there any questions from the good oh good if i may so right now so the quag UK must have slowed right yeah quite a bit so how do you what do you pick as new right now because i mean we had this discussion about perhaps pulling a plug on this but actually i think we should continue uh with with um with a selected set of high quality samples so how do you pick the right samples right now so yeah for a long time until this summer we actually had to just randomly sub-sample the coca UK data yeah because it's really not worth i mean they were at peak at peak times they were submitting like i don't know 20 000 samples a week or so and that's just crazy to keep up with and also probably not worth analyzing all that data so i was essentially just doing a random sub-sample down to a fifth of the data so and was analyzing just that it gives you still a representative snapshot of what's going on now we can slowly revert if we want to actually to following and analyzing all of the data again or maybe half of the data so so we just basically increase the ratio of what we are analyzing that's how i'm currently doing it of course we can also constantly i mean we're not just monitoring coca UK data but we also have this collaboration with Estonia and for them we are really analyzing all of the data even during peak times we are also monitoring greek data and there we're currently lagging a bit behind but that is relatively easy to catch up now that there are not so many new samples anymore um yeah and then we have to see what the winter season brings right i mean there could be new variants maybe it spawns new interest in sequencing you never know but um let's hope not let's hope everything stays calm and um yeah and slowly i was wondering if we can initiate something similar for monkey blocks i actually haven't looked for what's the current sample once again quite yeah monkey pox to slowly slowly die off i mean it takes a long time a surprisingly long time but case numbers are decreasing i mean we could still do this retrospectively and analyze all the monkey monkey pox data but doing it live currently is probably not worth the effort to set this up um what we in the EU team are currently really interested in is um avian influenza so we are planning to build something similar much smaller scale because there are not so many samples um to monitor global avian influenza samples um the core part will also be these workflows pretty much the consensus building workflow and the variation calling workflow only thing you have to do for um avian influencers have uh still an upstream workflow to the variation calling one uh where you determine the subtype of avian influenza first so essentially you need to build like an assembled reference genome that picks the right influenza segments and merges them together like if you if your sample is h3 and one or something then you want to make sure that the h segment gene segment is h3 that you're using as the reference and that the n gene segment segment is one and you want to merge those together do variant calling against that hybrid genome um and then build a consensus genome from that um that will be the difference we think um yeah that's one of the things we would like to do but there are also these other I mean luckily many of the of the viruses with the highest pandemic potential like nipah virus um they are rather similar to SARS-CoV-2 in terms of like what kind of virus they are they are these small relatively small genome RNA viruses so those should in principle all be analyzable with that basically same workflow without big changes hopefully very cool do you think the new deferred data will be helpful in producing maybe the Jenkins runtime because instead of I mean it sounds like the scripts upload wait for upload completion uh and then build the input files you could directly build the input files and just use deferred data yes if that works well and doesn't cause all kinds of downstream failures then because maybe one of these data set doesn't become available at all or whatever but this needs to be explored but if it works well then it would greatly simplify things I guess yeah um yes because all that monitoring when something is complete um yeah can completely be eliminated yeah yeah another that could be nice in a nice improvement is actually I mean we're lying a bit I don't have that slide now on anymore but um where we're saying we're doing this automated bot runs with the workflows from workflow hub or docstore I mean basically we're downloading those releases once onto the servers and then we're referencing them by ID on the server a cleaner approach that's also an area where we could improve things would be we would really do every one every time a fresh download via the TERS ID the the tool I mean no I mean how I would see the ideal situation is that you don't say by ID and you just give the TERS ID and then the server will get it or run it if it already has exactly right so currently what would happen if you if you instruct Planimo to take a workflow from elsewhere it would upload it every time into your user account so if you run that workflow like 500 times you would have 500 copies of the same workflow so this is an area where we could improve I mean Planimo I mean I think ideally the workflow run I could accept directly TERS IDs so maybe validate that it's a valid ID but that's it then run it yeah and the caching thing yeah if it downloads it it should either download it not into your workflow list but just to some kind of staging area or keep track of what it has already so that it knows okay this I've downloaded already like two minutes ago and now I'm not going to do it again yeah I mean they go to the little compute node right so you if you run your job I mean if you start many different workflows from the same input file then it's not a good idea but I think we don't really do that so yeah yeah on the flip side you're not keeping the file unless your first step makes a copy of the inputs yeah true which I mean I don't know but then that kind of negates the whole advantage um yeah I mean that we can discuss some more maybe next week also when you had the European Galaxy days could be a good chance no um so something else that while we were working on the initial stages uh of this was I would be pretty cool um if there's a workflow I mean if you programmatically knew when the work location is done right so the the best we have currently is an email and I guess you could automate that for higher throughput uh but I'm wondering if you all if you see the same way or if this is just a dream of a developer um and and it doesn't really matter but I'm thinking if we could you know kick off a webhook or something when a workflow has finished would facilitate your analysis yeah this would be very nice also if there was an easy and reliable way to query state of workflow like some simple planimo interface for that instead of like using bio blend to to go through the history and see what has run which is in what state some summary that you could obtain easily and query the workflow and it answers like okay 30 percent done uh with so and so many datasets still do you mean the jobs for per invocation yeah something like that that could also be cool um yeah um and there is actually a bit of weird thing about planimo at least on use galaxy.eu that I still need to explore at the moment so I mean I mean one way to know when a workflow is done completely is that the planimo run call would return on the command line right so if you just write a linear script you know it's done because planimo returns but for some reason at the moment workflows are often long completed on use galaxy.eu and planimo just doesn't return and it's unclear what it's waiting for when everything is finished yeah and I don't know if this is a server or planimo or some other kind of issue so that's currently something I'm exploring um yeah but this whole area of keeping track of how far we are into workflow that could need some improvement yeah I guess some emails when it fails would also be good or yeah yeah um how are you dealing with intermediate data like um I mean not not every output dataset of the workflow is valuable right yeah true um where currently we still have it all but the idea is that at some point we will just write another script that goes through all the histories which we keep track of in the jason file and we define just which are the valuable outputs that we want to keep more long-term and the other one we just remove via the API that's the plan but currently there was always something more urgent than cleaning up so we still have it all in our histories yeah I also still have 10 terabyte of my postdoc data on the system let's publish this again so I just need to so I in in combination with the continuous monitoring and this idea that Sergei had on uh reanalyzing data from some high-profile papers I think we should be to go and I think you already started processing of this so yeah I mean publishing so of course that other thing where there's always something more urgent to do than writing up the manuscript finally I mean um there's always bugs to fix new samples coming in some hype about omicron or whatever so this has been very exhausting two years but um yeah I mean the product is yeah I know it's the right the standard will be a new variant that will need to urge yeah if you can uh put the link to the slides somewhere yes of course um should I just I can't just put them in right here now in the chat um wait that's not the one for sharing I can first share it with everyone we'll also post the slides to um to the agenda for a future reference well thank you so much for going for the presentation yep you're very welcome if there aren't any other questions uh we will see you all in uh two weeks on October 13th for the next community call thanks everybody for joining thank you thank you