 I have the pleasure to introduce Ivan Topolsky for this webinar. Ivan got a federal diploma in medicine at the University of Geneva in 2005. Then he got a master in proteomics and bioinformatics at the University of Geneva in 2009 and has been working at SIB since 2013. In 2017, he joined the computational biology group of ETH Zurich in Basel, headed by professor Nico Berrenwinkel. His group is also a member of SIB. Ivan is currently working in the team developing V-Pipe, a bioinformatics pipeline for viral sequencing data. So today Ivan will show us how to apply V-Pipe, to source coronavirus to data. Ivan, the mic is yours. Thank you, Guigua, and welcome to this webinar about V-Pipe. The purpose of our research is to study RNA viruses, which have some interesting properties. They have very high mutation rates. They reproduce very quickly and they exist in very large population sizes. This has the result that these viruses exist as a mixture, both across a population and within a single patient. These mixtures can have clinical impacts. For example, one new mutant can emerge, which can have some clinical relevant properties. In the case of HIV, for example, that could be resistance to some antibiotics or in the current discussions which have been happening about SARS-CoV-2, this could be a mutation of one of the surface proteins. Over time, these viruses' population, if it has an advantage, can take over the population of viruses present in a patient, at which point we can, for example, observe a failure of the treatment. Currently, the history of sequencing has been done with Sanger, which gives us an overview of the whole population, only the consensus. It would be clinically interesting to be able to go back in time and see already when there's only one variant emerging. This is the type of possibilities which is now offered by new sequencing techniques, such as next-generating sequencing on devices, such as, for example, those from Illumina. These techniques, on the other hand, come with some disadvantages. The reads are short and there are possible sequencing error happening. That's why, in 2017, a new software pipeline has been started at the SAB, V-Pipe. The purpose of V-Pipe is to take input files, which can be the raw data as they come out of the sequencing machine and a reference of the organism that you want to study. Then it will automatically perform quality control, it will align the reads, and it will look for the variants. At the end, you obtain several useful files. You can get the consensus sequence for this particular sample. You can also get the variations, such as SNV, and the pipeline will attempt to remove the sequencing errors. You can also have information about the various haplotypes which have been present inside this sample. This could be done either at the local level, inside Windows, or at the global level over the whole genome of the studied virus. The code of this pipeline is available publicly. Recently, we have introduced a new installer for making distillation much simpler and quicker. This pipeline relies on snake makes, which tends to bring a few interesting properties. Among other, we have paid attention that all the elements of our pipelines, both those who have been developed inside our group and also all the third-party components on which we rely, are all available as bioconda packages. Thus, snake make can automatically help downloading them and deploying them as needed for the execution of the pipeline. Here is an example of one of the components that we have developed for this pipeline, which is relevant both for other researchers, such as HIV or the current research, which addresses us today, which is SARS-CoV. It is SURA, it is a tool which is designed to help distinguish variants. It can work both at the level of local haplotypes and at the level of SNV. It works by working in a window where it sorts the reads which fall within this window. Then it attempts to cluster the reads by similarity together. Then within each cluster, it will take the consensus. This will average out read errors which are going to be spread all over the reads, whereas reads clustered together will all belong to the same haplotype, from which then we can call SNVs with a high confidence, knowing that we try to eliminate all the errors. So if you want to use V-Pipe, we have information about the wiki associated with the GitHub page explaining how to do it. And there's also a tutorial which was written with SARS-CoV-2 in mind, showing how to apply V-Pipe on your SARS-CoV-2 data. This tutorial also demonstrates how to use the installer, which was developed for quick deployment. It's extremely simple to use. There's only two commands, one to fetch the installer, another to run it. When running the installer, you can decide where you want the software to be installed. You can decide to already have the installer prepare a working directory where you can start analyzing your data and you can specify which version of V-Pipe you want. There are multiple versions of V-Pipe available. You can select which branch you want. For example, today I'm going to speak about the SARS-CoV-2 branches, which is the branch that we have specifically set up for the SARS-CoV-2 analysis. There's also the master branch, which as examples contains defaults set up for HIV. There's also another branch in which the current development is happening, but beware there might be bugs there. So do not necessarily rely on that one. Please contact us before. It's also possible to download stable releases and match your packages. For example, the recently done publication, which was submitted to Preprint about V-Pipe, has been performed with a specific release that you can download if you want to replicate. And for the people who have been following us from the beginning of the SARS-CoV-2 epidemics, this is the snapshot that we released earlier with the first feature sets. This week, there's probably going to be a new version available which contains the visualization that I'm going to demonstrate. To analyze your data, you need to organize them in a specific directory hierarchy. It has two levels. So the samples directory contains a first level, which usually could be patients or samples, depending on your workflow. And inside you can have a second level. For example, in a cohort, at which point of time this analysis happened, or in a huge sequencing project, at which batch this sample was sequenced. Inside, you put your data in a folder called raw data. And in case you're using paired ends, it's very important to keep R1 and R2 names at the end of the file names. So VPype can know that the paired reads are matching together. In addition to that, you can have a sample table listing all the samples. If you do not provide one, it will be created automatically. But you can actually also manually edit it in order to add an option such as the read length. This table has three columns. The first two columns corresponds of the two first levels of sample organization. And an optional third column contains the read of this particular sample. VPype is extremely configurable. That's why the various version I presented before come with same defaults. For example, the Saskoff 2 version will use the same reference as used by other labs. And we are going to provide you some annotations. And as example, the primers used by the Arctic protocol which is used in our settings. Another example of configuration. By default for Saskoff 2, we decided to use the BWA aligner given the lower diversity in this virus. As I mentioned, there are a lot of components which can be swept around. The aligner currently used for Saskoff 2 is BWA. But for example, we have developed an aligner which use hidden Markov model for aligning NGS data which is much more performing for highly variability regions. It has been demonstrated in use against HIV. And in addition to Shura, we can also use the rather common color low freq. It's also possible to use different engines for global appetite reconstruction. Currently, we're not doing it in Saskoff 2 but we have Savage available as a default. Or we can also use another software which was produced in our group, which is haplocleak. To run it, you only need these two commands. You can first ask Snakemake to list all the steps which are going to be performed. And if you're happy with it, you can use the use conda to ask it to download all the packages and you specify on how many cores you want it running. After execution, it's going to produce several output files. It's going to, for example, to produce alignment files in a new directory which was created, which is alignment. And it's going to produce consensus sequence both at the level of each sample. Each sample has its own consensus built inside the reference directory and also across the whole cohort in the root variance directory of the working directory. It can also give you information about the variants which were found. It is both output as the standard VCF formatted file that you can then use in whatever is your favorite downstream analysis component of VCF file visualizer. It also comes with some visual reports that you can use to visualize these variants. I'm going to show them now as they're interactive. This is the type of reports that you can see at the end of the processing. It lists all the SNVs both in a tabular form and also as a graphical form along the chromosome, along the viral genome. For each SNVs, you can look at them and you can also see how they fall within the annotation formats which were provided. And here as an example, we also look at the primers which were used in the sequencing of this data. So you can try to see if there's some impact on the coverage which is shown here in a lower shade. You can even if you want, try to include lower quality SNVs if wanted. It's also possible to run SnakeMic on a cluster and therefore also use V-Pipe on your cluster. This, of course, is slightly more advanced use and will deeply depend on how your local cluster is set up. As an example, here at the ATH, it's possible to run the whole V-Pipe master job. So the dispatching of SnakeMic inside a cluster job where we just use the standard option to call a job. And it's also possible to have specific parameters for each job in order to dispatch them efficiently on the cluster. This is done using the cluster option of SnakeMic. Of course, you need some tuning on a cluster and most of the tasks inside our SnakeMic can have corresponding configuration options where, for example, you can set the amount of threads or the amount of memory which I use by this task. We are lucky to have an extremely large cluster at ATH. So tasks which are embarrassingly parallel such as the window with analysis of a performed by a Shura can benefit of lots of cores but also can eat a lot of memory in these situations. Do not hesitate to contact us if you need tips about adapting V-Pipe to your local cluster. And here are some examples of how V-Pipe is used in the real world. We are currently performing sequence analysis as part of the Swiss sequencing project. On our side, there are samples which are gathered from the Viollier laboratory in Basel which come from all over Switzerland and which were tested positive. Those are brought to the genomic facility in Basel where they are sequenced on an Illumina MySick machine and the data is made available on the OpenBIS platform. V-Pipe will automatically gather the data from there, perform analysis, perform also some quality checks to make sure that the data is working as it should. And then part of this data such as, for example, the consensus sequence are uploaded in G-Seed enabling then downstream analysis by Nextrain and the computational evolution groups. So to show in details, this is the kind of data that we obtain from the OpenBIS repository. We have scripts using the official API which will automatically arrange in the kind of structure that we need. So samples, then batches in which these samples were each treated and then inside the two files kept together with R-Run R2. After running, we obtain, as mentioned before, cohort consensus and for each sample, we obtain multiple data such as the alignments, the reference that we upload to G-Seed, also visualization and other such information about variants. This is the type of interface that we have at the OpenBIS site with data sets. I showed them before how they are now downloaded and rearranged in the VIP format. Then we generate this type of reports when we can have a summary of which sample did pass test or failed them. We perform whole batches of tests on each samples in an automated manner. We can, for example, check for the coverage, for the quality, the amounts of reads which were rejected due to batch quality filters or the amounts of reads which are kept for alignments and so on. This table gives you an overview of all the checks which were performed. But it's also possible to display information for each test. For example, it's possible to display the coverage per sample to check those who pass and reject, for example, here the negative controls, those who don't pass, or have, for example, an overview of the coverage across the genome of the virus. This is then uploaded to G-Seed, which enables this kind of analysis of illogenetics by the teams of next train and computational evolution group. And recently they have released a narrative explaining how to read this type of data. And now it's available on the Swiss National COVID Task Force as one of the tools to evaluate the evolution of the virus. This is currently work of progress. We're going to increase the level of automation to be able to process even more samples and to process them as they come new positives. And the quality checks that you saw are going to be merged into the stable branch of the pipe and are going to be made available in the coming days. It's not the only target that we want to provide this data to. We're also going to work with the laboratory of systems and synthetic immunology of CYREDI at some point in the future. Another much simpler because it's only single lab, but nonetheless also large scale because we're still handing a lot of samples. Project is the survey of the data which is publicly available. Recently we started at the beginning of the pandemic to look at what data was available online. There's already back then quite some which were available. At least the sufficient part of them were done on Illumina type of NGS sequencing. Though back then they were mostly only from two countries, Australia and USA. And although there were a lot of metagenomics where there's also human DNA and other external DNA, there was quite a few which were only SARS-CoV-2. Of them also are both available in PIRD and single NGS, although the intersection between the sequence we interested us turned out to be mostly PIRD NGS by coincidence. The pipe has been used to perform quality checks, some filtering alignments and also to reconstruct the local haplotypes in windows and call the SNVs. This is an example of the coverage per sample which can be obtained at the end of the alignment step. And for example, help us select the samples which have an interesting range of coverage. We want to avoid too low coverage to avoid calling mutations on too much uncertainty. This is just a screenshot because the work is currently in progress of the kind of analysis which have been performed, both showing mutation hotspot and the level of entropy across the genome, depending on the samples. As I mentioned, this is work in progress because now the availability of sequencing data has much increased in the public repository. Now we have much more countries and a lot of more reads. So there's currently a new run being executed and hopefully in the close future we'll be able to have the final survey with some publishable data. So do not hesitate to contact us. VIP pipe has a webpage where you can find information such as for example, a mailing list where we do announcement, for example, of the upcoming new versions. There's also contacts if you want to write an email to the developers. We also have tutorial and much more on this website. So do not hesitate to go there and try to contact us. Finally, I want to say thank you to both all the current team who has been contributing on all the work I have shown here, including also all the external people we have collaborated with. And thank you very much for having listened to this presentation.