 Great, so we really look forward to your talk about this pipeline. It's been a Pipeline that was released two months after starting being created, right? Yes. Yes. It was quite a mammoth sprint to get this out. It's quite, yeah, I'll talk about that now. Perfect. So we look forward. So this pipeline was the works really started on this towards the end of March when Covid was becoming a bit of a pest in in in the UK and around Europe, mainly. And NFCOR, other people on NFCOR decided that they wanted to help contribute pipelines to analyse Covid data. And if I must say so myself, I don't think there's a better place to put these pipelines, because you can imagine people trying to run these pipelines all across the world and would like to run it in a reproducible, standardised manner. And so we thought we would attempt to contribute to the effort for analysing Covid data. We started off with a bit of haste, creating a couple of pipelines, which we then have deleted. The main reason for that was that we didn't really have any domain experts, domain knowledge, and it's very difficult to develop pipelines like this without having people who know what they're doing. And most, and of course, pipelines already have this, if not all of them. Otherwise, the pipeline just doesn't work. So luckily, Michael Hewitt at the time put us in touch with Sarai, Miguel and Sara from Busk in Spain. And they had already written a couple of next level pipelines to do this. They are, I believe, part of a surveillance unit in Spain looking at outbreaks and other things in Spain. So we decided that we would port their pipelines. Luckily, they already had a couple of pipelines to do this. One was for assembly. The other one was for variant calling. And so we decided after a very quick chat, they were up for it and they said, yes, let's do this. And so we got together and started developing this pipeline on the 30th of March. So the pipeline was, to summarize, a pipeline is was being developed to do Danavo assembly on viral sequences and also to call variants or look for mutations in the virus. And it was developed specifically for Illumina sequencing data. So this could be data coming from real clinical samples, directly from clinical samples, or it could be based on enrichment protocols like the Arctic protocol. So after two months of madness, we finally released the first version of this. We massively imported both of their pipelines into a single pipeline. And we then ended up with three variant calling routes. So we've got three variant callers. We were using BASCAN, IVAR and BCF tools. And we're also using four different assembly algorithms. So we're using spades, metaspades, unicycler and Minya. And so this just gives the user more flexibility in terms of calling variants and doing assembly. As some of you may already know, you get quite different results depending on which algorithm you choose. And so we decided to give the user the options. And to the right there are the people that were involved in the first release. And so thank you all for contributing and getting involved and helping out. There are also a number of other things we added to the pipeline, which are more generically applicable like SRA, ENA and GEO type download. So as you know, if you use an extra pipeline, there's a from SRA method that allows you to do this. But there are a couple of limitations that we found that warranted may be writing a custom script. So this took a while. Jose was involved in this and we wrote a custom script, which actually works really well now and we can do MD5s on checks and other stuff within the pipeline downloading this data. So the pipeline can be used directly with SRA, ENA and GEO IDs as well. So you don't have to download it beforehand. There was a variant graph generation process added in and this came from help, I guess, from Eric and Michael to add this in. It's apparently quite a big thing at the moment. We also use this pipeline to... So on NFCOR, what we're trying to do is not only test the pipelines on small data, but also on large data, as Alex just mentioned in the previous talk. And this pipeline really was the first one where we stress tested our AWS GitHub actions and stuff on real full-size data. And so Gisela and I got involved and done quite a lot of work to get things working for that, which was really cool. And the pipeline was also written in a very flexible way. So there are extensive options in the pipeline to so you can literally run one process within that pipeline if you use all of the correct skip values. So we tried to make it as flexible as possible. And another thing that we, I think, are really proud of, what I am, is that we added a lot of custom metrics into the Multi-QC Report and Phil, Alex helped with this a lot before the last Multi-QC Release to add in modules we could use for this pipeline. And so if I take you to the report here in the browser, you can see it's quite an extensive Multi-QC Report and we're able to then add in some metrics that are just relevant to give you a good enough summary of the pipeline results at the top of the report, for example, this was quite tricky. But in the end, the way that I decided to implement it was to run Multi-QC once, let Multi-QC do all the heavy lifting and parsing of all of the information we needed. And then write a separate script that parsed the files that Multi-QC dumped, the YAML files, and then run Multi-QC again to inject this back into the report. So it was really interesting, really cool. But again, this also can be something that can be used for other pipelines as well. So you may not want a thousand metrics being thrown at you. You may just want a select set of metrics that help you assess how good the quality of the data is. And that's what you can use this for. So if I go back to my presentation. So after the pipeline was released, the quick where I work, just to give you a bit of background, have been doing testing RT-PCR tests for mainly for healthcare workers from University College London hospitals or UCLH. And they started doing this quite early on when the government was still trying to figure out what time it was. And so this was led mainly by some amazing scientists we have at the CRIQs, Charlie Swanton, Sonia Gandhi, Steve Gamlin, Jerome Nicole, who's the head of sequencing at the CRIQ. And they decided that they wanted to help with the effort and test healthcare workers. And they set up an entire pipeline within two weeks. And given the fact that reagents weren't available at the time, they decided to make them in-house. So they've been doing this now for healthcare workers. They still have this pipeline running, but we also now are using it to do weekly staff testing in-house, which is great within the CRIQ. And so far, we've got a thousand positive samples from approximately 50,000 tests that they've performed. And the idea would be with that at the moment is that the CRIQ would like to submit those thousand positive samples to COG UK, which is more of a nationwide initiative to track the evolution of the virus and also to track whether the outbreaks are due to local or coming from abroad, for example. And the way you do that is by looking at mutations in the virus. And they've done an amazing job setting up this infrastructure as well, and they've sequenced up to date now 35,000 viruses, but I think that might be an underestimate. And so the idea was that the CRIQ would upload these samples to COG UK. And the initial plan was to sequence a small set of samples on the Illumina MySeq using the Arctic Amplicon protocol, which is quite a popular protocol that's used for this. And for those that don't know what it is, if you... This is a screenshot of IGV showing the reads mapped to the COVID virus. So this is actually real data. And from the second track at the bottom, you can kind of see you've got staggered primers that cover the entire genome. And the idea behind this is that you essentially are able to then enrich by using these primers for the viral sequences, because a lot of the time you can imagine when you take swabs or any biological samples that the viral RNA or DNA is actually in quite low abundance. And so you need some way of enriching it and the Arctic primers are one way of doing that, essentially. So for transparency, once we have the data for those hundred samples, these are the parameters I use to run the pipeline. So one quite neat thing here is the NXF work variable I'm using. And I'm not sure if many people know this or use this, but it's quite cool to be able to output the work directory in a separate scrap space if you have one available. And this just makes things easier to deal with in terms of tidying data up and keeping things slimmer. So our working area is actually snapshotted and we don't want the work directory to be snapshotted as well. So we can output in a separate scrap space because it's just intermediate files. And so this was the command that I used to run the pipeline and this was the config that I used for the genomics. It's just nice to be able to bury away all of these things because they can be reused in exactly the same way. And so this is just me specifying that I'd like to use this version of the reference, this version of the Amplicon file and a local Kraken database as well. So one interesting case, I'm going to try and give you some biology here as well. One interesting case was that we had a couple of samples that were from a health worker who tested positive and then a couple of months later tested positive again whilst testing negative in between. And so there was interest in figuring out whether this healthcare worker actually contracted a new virus or new strain of the virus or had the same virus. But you can imagine it's quite odd that the virus just is there, disappears and then comes back again. And so this was quite an interesting case. When we ran the pipeline by default, the pipeline has a threshold for calling variants. So we have a lower threshold, which I believe is a 0.2 or something and a 0.25 and a higher level of 0.75. And what that essentially means is if you look at any particular base in the genome of the virus, what proportion of that base is a mutation or the mutation that you're looking at. And so because the virus has only got one strand of RNA, you can assume that you would, any mutations would show up with a frequency of close to one. And so we use a threshold of 0.75 to say that, okay, this is a high confidence set of SNPs and Indols if they are present at that allele frequency. However, it turned out when we looked at the data that there were some SNPs, as you can see on that table, there were some SNPs that were indeed common. However, there were a few that weren't agreeing across the samples. And this obviously raised some questions. So if we look at one of the SNPs, you can see that in sample two, you virtually have zero coverage in that region where the SNP is called. So you can see the red line at the top, that's the mutation in the virus. But if you look in the second track, there are no reads covering that location. And this it turned out was because the sample had a low CT score. So the CT score indicates how many cycles of PCR you need to do to get a certain amount of virus, I believe. And the higher that is, the lower quality the sample is. And the second sample was essentially a lower quality sample. And so you can see genome-wide that the coverage isn't as good with the amplicons. And then if we zoom in, you can see that red bar, which indicates that that amplicon has also cut out. So these were all plots and figures we decided to add in after the first release to improve the pipeline and to help troubleshoot these things. And so these are now generated by default by the pipeline. And so there's nothing you can really do about that. If you don't have coverage in that region, then you can't call a SNP there. We also added some other plots. So this is a heat map across all of the samples. And you can see at the intersection of those two samples, there's a blue bar there, which shows that there's low coverage. And again, this is really helpful to just look at the data straight away and to see, for example, which amplicons have low coverage and have dropped out and what the quality of the samples are like. So you can, for example, the samples at the bottom, four samples at the bottom, which are heavily blue, there's very little that we'll be able to do with that data. And that's low quality sample. There's not really much you can do about that. There was one SNP, however, where there is a discrepancy. So you can see the red bar in sample two at the bottom, but you can't see in sample one. This would need further validation, but it could be down to the fact that the sample two is lower quality. And so you've introduced a mutation, something that looks like a real mutation due to things like PCR or other things. Or it could just be, it could be the fact that the sample was poor. So this needs to be looked into more, but it's definitely interesting. The insertion that was called in one sample and not the other is actually in both samples. It turns out, so you can see now, this is I think a nine base pair or a six base pair insertion. And it turns out that it's actually in both samples and this has actually been observed in other samples around the world. And so it appears that this indel is real, but the reason it wasn't called in one of the samples is because if you look at the reads, if you look at what you would say is masked portion of the reads, you can see that they've been soft clipped off. And the reason is because the indel is at the end of the read. And so this now is more of an alignment issue. It's the reason it's there is because if the indel is at the end of the read, there isn't enough flanking bases there to be able to map that indel correctly. And so these get masked and that indel doesn't get called. And so it has a lower allele frequency that doesn't pass the 0.75 threshold that we use. And so then what happens is that it doesn't get detected. And this also, I guess, highlights the importance of actually looking at your indels and snips that you get in a genome browser to validate whether they're real or not. And so after all of this work, in collaboration mainly with Jerome from the sequencing facility, Robert, Richard helped a lot from the group with all of the plots and Angus, my boss. We managed to get another release of viral recon out with adding all of these plots and also adding functionality to intersect the variants across multiple callers. And so what that allows you to do is see how consistently the variant is called across all three callers, so BASCAN, IVAR and BCF tools. I also added in a neat little bit there to the pipeline, which I pinched a lot of the code from Phil's work on the RNA-seq pipeline. And the idea behind that would be that ignoring samples that had fewer than a certain amount of mapped reads to the viral genome, because we don't know the quality of the samples coming in, you can create a massive sample sheet, for example, run the pipeline and then the pipeline just fails because if you have poor quality samples where you don't get any mapping, then the pipeline will just die when it gets to bow tie or something or downstream when it tries to sort the bound file. And so to avoid that, we added in this min mapped reads parameter, which looks at the number of reads. So it takes the bow tie, flags that output, looks at the number of reads that you get. And then if it's less than a given threshold, then it just stops running anything downstream on the pipelines. And this just means the pipeline breaks less given the fact that we expect poor quality samples coming in. So some future plans definitely need some DSO2 implementation here. There are a lot of processes that are reused all over the place given the fact that we're running three assemblers and sorry, three bearing quarters, four assemblers. The downstream processes after those steps are pretty much the same. And so it'd be amazing to get a DSO2 implementation given the fact that we needed to get this pipeline out as quickly as possible. We didn't really do any resource benchmarking, but it would be nice to do that. We have obviously stress tested the pipeline quite a lot on small, large, medium data sets and it does work. So it'd be nice to refine this a little bit more. We're also working with Anton and Dimitri from the St. Petersburg State University who are the main authors of SPADES. And so they're now on our Slack and helping out and stuff. And they've written a new Corona SPADES assembler which is specific to the outbreak. And the reason is because they were finding that the assemblies coming out of SPADES were broken a lot of the time or generate too many Contigs and they wanted to refine it within the context of Corona SPADES. So recently I think Anton and Sarai managed to get a single continuous Contig from an assembly which is really, really cool for the virus. So yeah, we'll work on that and adding that into the pipeline. Also some talk with Artem from the Saratis Project who are trying to scan, mass scan the entire, I think believe the SRA database to look for unknown coronaviruses and known coronaviruses. And they're interested in using the assembly steps in the pipeline to do this. And so hopefully we'll do some work on that in the future. And yes, thank you all for listening. I'd like to thank the NFCOR community for coming together. This is really a good example of where a lot of people that just put their hands up and got involved in the pipeline because they decided it would be important to do this. The next flow community, as always, Paolo, Evan and everyone else there. The Cric COVID-19 Consortium, Cog UK, Cric ASF, so Jerome and Robert who really sort of helped Delve and look at the data for us to get this second release out and add all of these lovely plots to the pipeline. And my group as well, who are just amazing and we're managing to get through this together, Richard especially who helped with the plots and Angus who's working with me now on uploading all of this data to Cog UK. And last but not least, Gisela and Rike who have done an amazing job with this, the organization of this hackathon. And yeah, I'd just like to say thank you to them. As I said, but no. Thank you. And thank you very much, Arshil, for the explanation of this pipeline and also for your great job in leading this collaborative effort to develop the pipeline. It was certainly crucial for bringing it to a release, to a first release in within just two months. We do have quite some questions by the audience. So some more technical questions from Tan Levyet. What city value cutoff did you use for including the source of two samples for sequencing? Good question. So Jerome has actually done a lot of this work. It's amazing what you can do with Excel. I don't have the patience, but he clearly does. So the pipeline generates lots of different metrics. Like I showed you in that multi QC report, you get those metrics in the report, but there's also a flap file where you can import this data in. And so he generated loads of plots that correlate CT with mapping, with variants called and so on. And I think he's having discussions with UCL and others to an Imperial possibly to figure out what cutoff to use. But off the top of my head, I would say 30, I think, 30. I think 30, between 20 and 30 anyway. Anything that has a higher CT value, it's pointless sequencing because it's just too poor quality. And so it's a waste of money and time. Okay, perfect. Thanks a lot. And if you'd like to know the exact value, I'm sure. Yeah, we can talk about it in the pipeline. Yeah. Another question by Paul. Can the viral recon pipeline be used for discovering novel virus genomes? Yes, absolutely. So the variant calling branch of the pipeline requires a genome input. And so you need to know what virus you're looking at beforehand. However, the assembly steps can be used for that. At the moment, they're more geared towards looking, refining the assemblies for just the SARS-CoV-2 virus, but the assembly can generate as many viral context as you want, essentially. So the context that the assemblers generate can have other viruses in it. We just decide to focus the downstream steps on SARS-CoV-2. But there are ways that you can maybe use the context that come out of the assemblers and try and look for other viruses or novel viruses, for example. I might just jump in and say, we have a couple of other pipelines as well. It depends a lot on the input data that you have, which pipeline is most suitable. But we also have Bacass, which is specifically for assembling bacterial genomes. And we also have MAG, which is for metagenomics when you don't know what's in your sample, which does assembly and binning on chemists. I haven't read the paper you linked to. I wonder if MAG might be the most similar approach just looking at figure one schematic. Okay, another question from the audience. So are there any plans for incorporating oxford nanopore sequencing into viral recon? So, yes, we had discussions about having a nanopore pipeline to do pretty much the same thing, right at the beginning as well. But, yeah, they sort of fizzled out and we decided to focus our effort on the Illumina stuff because obviously Sarai and Sara had developed this pipeline and they're ready to go. It would be nice to add nanopore pipeline. In fact, we've got an NFCore Arctic repository, which is one of the pipelines that I said we created in haste, which hasn't been worked on for a while now. And that was actually created exactly to work with nanopore data as well. But apparently there were a lot of things changing at the time and so we didn't do anything with that. But yes, it would be nice to add that. I think it would probably have to be in a separate pipeline though. Maybe it's an extension to nano-seq possibly. Yeah, if anyone wants to help with that, that'd be amazing. Yeah, we have at least another pipeline that can also incorporate arcs for nanopore technologies. But now I couldn't say if it was nanoseq or back-ass. It might be back-ass as well actually because I think Alex added nanopore support to that possibly. Yes, it can work with nanopore data. Yeah, the varying calling and all of that type of stuff is a bit more customized, which may require a separate pipeline or be added in a different context maybe. Perfect, thank you very much. So I think also Phil had an additional question. So how come that there were only 1,000 samples submitting to the COG-UK strategy but 36,000 samples were sequenced? Ah, sorry, those figures may have been confusing. So the CRIK has done over 50,000 PCR tests for healthcare workers from UCLH. A thousand of those had tested positive. And so those are the ones that we would then now follow up and sequence, providing their CT values are good. The 36,000 are the total for COG-UK. So that's a separate thing. COG-UK have their own initiative to sequence the virus and that's coming from various universities and the public health agencies in England that's coming together to do this. And so they've been told the sequence 36,000 but I'm pretty sure it's more than that. I thought I might quickly jump in and have a quick mention of multi-QC just because it's my pet project. And Harshall has done loads of work on multi-QC as he showed for this pipeline. And it's been, I really like working together with other people who are using multi-QC for a very specific reason because it comes up with all these weird edge cases and real life use scenarios which I never anticipated. And this was a great example of that. Harshall said how we did it in the end by running multi-QC twice, which works and makes good use of the fact that multi-QC can kind of give a standardized output for any different tool with the same kind of YAML or JSON. But I still don't love that. And I'm hoping we can come up with a better solution in the future. The most custom way to do crazy stuff like this is you can write a multi-QC plugin when you can run any custom arbitrary Python code within multi-QC, which would work brilliantly here. But the downside is that everyone would then need to install that plugin as part of the pipeline and make it slightly less portable and stuff. And so off the back of this, I've been thinking about new features from multi-QC. So in the future, I'm hoping to refact a multi-QC so that you can use it within a Python script and mess around with its execution in real time without having any custom plugin code. So yeah, stay tuned for that. But yeah, it's been really fun working on multi-QC and helping people with multi-QC. I was just really impressed with the way that it works. It just works. And yeah, I mean, thank you for adding all of those little things that I needed to get this report together at the time. It's just, yeah, it's all come together really nicely, I think. Happy to help where I can.