 Hello, and welcome everybody. My name is Nico Birenwinkel, I'm a professor for computational biology at ETH Zurich, and I'm a group leader at the SIB, the Swiss Institute of Bioinformatics. I will talk about mining viral genomes to improve clinical diagnostics, and I will present a new SIB resource called V-Pipe, a bioinformatics pipeline for analyzing viral sequencing data. I would like to start by motivating why we are interested in analyzing viral genomes, so it turns out that especially RNA viruses display a huge amount of genetic diversity. Within each individual patient, there is a whole set of viruses that are genetically related but different from each other, and if you compare virus populations between patients, they display even more heterogeneity on the genomic level. So this diversity is a result of the evolutionary dynamics of viruses. RNA viruses have very high mutation rates, they tend to have short generation times, and they typically exist in very large population sizes, and these parameters generate a huge amount of genomic diversity in relatively short time scales. So there is a lot of diversity both between different infected hosts called inter-host diversity, and also within each infected individual called intra-host genomic diversity. Here we will be concerned about the diversity that exists within a single individual. We are interested in intra-host genomic diversity of viruses. This diversity is relevant also from a medical point of view. For example, low-frequency viral variants can be involved in the development of drug resistance. Here's the time course shown schematically for a single patient infected with a virus and the different variants, the composition of the different variants changes over time, and at the time of treatment failure, the traditional approach to analyzing the virus population was based on Sanger sequencing, but Sanger sequencing only provides the average of the virus population, so the consensus sequence of the viruses. It is not able to resolve variants of lower frequency, and this is opposed to next-generation sequencing or NGS for short, which can be performed at very deep coverage and then allows for quantifying viral variants that pre-existed in the virus population also at lower frequencies, so this is a way to increase the sensitivity of viral diagnostics. Using NGS for viral sequencing and viral diagnostics comes with a number of opportunities, but also challenges, so NGS technologies on the one hand allow for assessing the genetic diversity of intra-host virus populations, but on the other hand, this task is complicated by the fact that the sequencing reads tend to be much shorter than the genomic interval we are interested in, and second, these short reads are typically error-prone, so both the amplification process as well as the sequencing step itself introduce errors, and these technical errors need to be separated from the true biological variation in the sample that we are interested in. We have designed a new resource at the SRB called V-Pipe, so V-Pipe is a pipeline where we are starting with the next-generation sequencing sample obtained from a virus population which has been derived from one individual patient, and then there's a number of computational steps in this pipeline that I will explain a little bit in the remainder of this presentation, and the result of this procedure is a reconstruction of the virus population that was originally present in the population, so it's an estimate of the virus population which can entail the complete composition of viral genomes and their frequencies in the host organism. So in this seminar you can learn how to run the pipeline called V-Pipe from the command line, how to use V-Pipe in applications, and also in benchmarking novel pipelines, and how to download and install V-Pipe. So what is V-Pipe? As I said, it's a pipeline, so we're starting with input data. The input data comprises the reads from the sequencing experiment in fast-kill format, and typically also a reference genome which is given in fast-a format, and the first step in the pipeline is quality control. There's various measures taken over to ensure the quality of the read data. Then there is a read mapping or read alignment step which typically entails a complete multiple sequence alignment of all reads against the reference genome, and in a way the core of the pipeline is identifying and quantifying genetic diversity. This can be done on different spatial scales. It can be done on the level of individual positions of the genome called SNVs, single nucleotide variants, or it can be done on the level of longer stretches of the genome, and in that case we will refer to haplotype sequences, so reconstructing haplotypes either locally or even globally over the entire region of interest. The output of the pipeline then typically comprises a consensus sequence and all the variants that have been detected or the SNVs that have been detected relative to this consensus sequence, and it can entail a complete set of haplotypes that were reconstructed from the short read data that has been assembled into the individual haplotypes plus an estimate of their frequency. So V-Pipe is a command line tool and it's based on Snakemake, so in order to run V-Pipe you call Snakemake with a parameter that specifies the pipeline itself and there can be additional parameters. In this particular command line call again there is Snakemake, the workflow management system that we're using to organize the pipeline. Then the pipeline itself itself is specified in this Snakemake file, so this contains the set of all programs that need to be executed in order to run through the entire pipeline and also the order or the partial order in which these steps need to be executed. And in this particular call we also invoke the use-condar flag, so this enables the condor environment which is a cross-platform package manager that allows easily installing components of the pipeline that are necessary. The data, the input data, the short read samples in particular is expected in a particular hierarchical structure and this allows to organize samples according to patients then having different time points per patient and possibly various samples per time point in patient. In this example, for example, all data belonging to patient one would be in this particular branch of this hierarchy. So the core of the pipeline is reconstructing genetic diversity and I briefly want to mention two key ideas. The first one is local haplotype reconstruction. So here we have the reference sequence shown on the top and all the aligned reads are shown here schematically. They come from three different variants in this particular example and they have a number of sequencing errors. Now local haplotype reconstruction refers to defining a window over this multiple read alignment and trying to identify the different variants called the local haplotypes within this window. This boils down essentially to a clustering problem and when you are able to solve this problem then the clusters will correspond to the predicted haplotypes and their consensus sequences, the cluster centers, they correspond to the predicted sequence for the local haplotype. Now the second key element of the pipeline is global haplotype reconstruction. This refers to putting together the local pieces given in terms of the reads into longer stretches. So haplotype sequences that cover the entire range of the genomic region of interest and it also includes an estimate of the relative frequency of these haplotypes and there's different algorithms that address this problem. They can be statistically in nature or they can be more combinatorially in nature. So this can also be seen as a multi-assembly problem of reconstructing the whole viral quality species. So can I also build my own pipeline? Yes, V-pipe is very supportive of this. In fact, there is a number of tools that have been developed in the research community either just for SNV calling or for local reconstruction, for global reconstruction. Some of them use paired-end data, some work on your single-end data. So there's a range of tools depending on the precise application scenario and they can also all be built in different constellations of the pipeline. In order to support this, we are offering a benchmarking platform. So this platform contains a module for simulating data. It assumes some two dependency structure among haplotypes and then will generate or simulate reads from this assumed population structure. In the second step, the pipeline itself is run on these simulated data. It can be the standard pipeline as we provided or any variation of it that the user can specify it. And in the last step, there's also support for assessing the performance of the pipeline, essentially by comparing the simulated data to the estimates of genetic diversity that have been derived using the pipeline. How can I get started using V-pipe? That's very easy. So V-pipe is really available for download from our GitHub repository and here's the URL where you can find the software. The requirements are Conda, as I mentioned before, Python and Snakemake and everything else can be done via V-pipe itself by invoking useConda and other switches. And there's quite some further information and documentation on the website, including wiki pages, where everything is documented in detail. V-pipe is intended to be a community-driven tool and there's a number of activities. We give tutorials on the pipeline on a regular basis. The next one will take place as part of the BC2 conference on September 9, 2019 in Basel. It's called Bound Linux Pipelines for the analysis of viral and JS data. There is a mailing list which you can subscribe to that supports both users and developers of V-pipe. And finally, there is a contact email that's v-pipe at bsse.ethz.ch. So to summarize, V-pipe provides a robust, flexible and extensible computational framework for the reproducible analysis of viral and JS data. By doing so, V-pipe supports clinical diagnostics and also research of viruses. So we want to provide a framework that really allows for reproducible results and also traceable results in a clinical context. The pipeline is actively maintained. It's open-source and it is community-driven. So finally, I would like to thank the V-pipe team specifically, Susana Posada, who has developed a number of the tools and who has put together the pipeline in the first place, and Ivan Topolsky, who is maintaining the pipeline on the technical level. He's our software engineer and also Kim Philipp Jablonsky. He's working on new methods for global haplotype reconstruction. Finally, we have received funding in the past to develop several components of the pipeline. And also, I want to mention a number of collaborators and former group members who have contributed to V-pipe. Thank you for listening.