 I'd like to first thank the organizers for the opportunity to present the latest or recent and ongoing work from our lab on the integration of multiple data types for virus discovery. About 10 to 15% of cancers worldwide are caused by viruses and are several virus types which we know which are associated with cancer. The role of others are controversial. One of the things that viruses do is integrate into the host genome and whereas others remain epizomal and do not integrate. One of the questions that we wish to address in this study is what is the role of integration in gene expression and also how is integration associated with gene initiation and progression? It's a broader question. Then more broadly yet the questions we wish to address are even though many types of viruses are ubiquitous in the human population cancer is not. So a large question is why do some people get cancer and others do not? The TCGA data provides a rich source by which we can start to address some of these questions. This is a graph of samples which are available for various diseases with various data types. For this particular work I will be focusing on four different types of cancer. All of these have viruses known to be associated with viruses to one degree or another and we will be looking at bladder, cervical, head and neck as well as stomach cancer. So a brief just overview of the particular pipeline that we have developed for this work, we start off with RNA-seq data which is aligned to a human reference and we extract the unmapped reads which are then aligned to a using blast to a large virus database. This then allows us to, this is the virus discovery part of the analysis and it allows us to form a small select virus reference which includes just the viruses of interest. We then have, we start with whole genome data which is initially aligned to the human reference and we realign it to this select virus reference which includes the virus and the human reference and also for the cervical work in particular we use cervical sequences which are aligned with, created with HPV probes and are aligned to the HPV reference to start with and we use these data to query viral integration. So just to give a broad overview of the viral patterns that we observe in different cancer types, in bladder cancer only about 5%, only about 5% of the cancers are associated with virus and these are the human papilloma viruses, the various herpes viruses and also BK polyoma whereas in cervical cancer nearly all cancers are associated with some form of HPV. For head and neck about 85% have no virus with HPV 60 and 33 being dominant and then stomach cancer has about 10% EBV. So to analyze the integration we use a reference that has human genome as well as the select virus and typically reed pairs will align to adjacent locations in the genome near one another whether it's human or virus. On occasion though we'll find a discordant reed pair where one of the reeds maps to the human and one of the reeds maps to the virus and it's these discordant reed pairs that we then analyze in order to look at for evidence of integration and in particular we take the coordinates of the the genomic position of the human reed as well as of the virus reed and we construct a coordinate of one point which corresponds to one discordant reed pair. We then look at a number of such of such reed pairs and clusters of such discordant reed pairs then correspond to evidence of viral integration and these can be identified both visually as well as using tools such as Pindell. So looking at viral integration we find that for EBV viruses we there are we haven't found any integration events. For viruses for HPV 18 in all the cases we looked at we did find integration. For the other HPV viruses about half to two-thirds are integrated. So for the next few slides I will now drill down into a specific sample and analyze that in further detail and this particular sample it's a head and neck sample which has 118 discordant pairs which map to three different clusters and these clusters map to human chromosome 14. We can now map this onto the the gene and this cluster providing evidence of integration is on the rad 51B gene which is a DNA repair gene which has been associated with virus integration in the past. Just to give you a sense of scale this is about the viral chromosome the viral genome is about roughly 8,000 base pairs this this I'm showing is a section of about 150 kilobase pairs in that dimension. This particular gene extends further out in both directions and there is one exon which is which is visible here with that location and the integration takes place in the intronic section. We can now you might have to turn your head to see this right but we can now look at the the copy number along the chromosome 14 and what we find is that outside of the integration region the copy number is is is roughly two. Within this particular section of the integration the copy number is is about 15 and then there is a smaller section here where the copy number is about four. We we can draw lines to guide the eye from the discordant read pairs to the to the chromosome and what we find is that the the discordant read pairs map onto very precisely onto boundaries in the in regions of the copy number and we find this to be pretty basically very commonly seen and we find copy number increases in integration sites in about two-thirds of the cases. For the remaining one-third of the cases we see with about equal probability copy number decreases or no copy number changes. We can also look at the copy number of of the virus and this scale here is about copy number of 30. One of the things that we observe is a is a deletion of a section of the virus. The the circular HPV virus as it integrates into the host cell DNA very frequently loses a section as it integrates and this is the last section that you see here. The the the copy number is a little bit more a little bit more noisy than we observe but for the human chromosome but once again the the discordant read pairs tend to map onto regions of differing copy number sections. We can also look at constructed histogram of the copy number and in particular I'll be looking at the blue line blue histogram here corresponding to the virus and what we find is that there are two distinct peaks in the copy number histogram and this suggests that there are two distinct viral populations which are found in this sample. We can also look at the the viral genes located here on the horizontal axis as well as the RNA expression of these genes and what we find which is commonly seen is that the E6 and the E7 viral oncogenes are expressed at a high level with the remaining genes being at a relatively low level and we can also look at the the expression of the rad 51B gene over here on the left is a is a distribution of the RSCM expression level for here for tumors a population of head and neck tumors and here on the right a population of head and neck normals and for this particular case that we're looking here the RSCM expression is the gene expression is is very high and we can actually look at more detail at at the expression patterns here is now a plot of per exon RNA expression for the 15 exons which compose the rad 51B gene and in red is the the per exon expression level of the particular sample we're looking at and then the teal color corresponds to the other head and neck tumor samples under consideration and what we find is that downstream of the integration site the the expression level per exon is very high for this particular sample whereas upstream it tends to be normal so this provides evidence that the viral integration up regulates the downstream expression or the expression of the downstream exons to look at this in a little bit more a little bit more detail here is now essentially the same data but I'm plotting the a normalized per exon expression level it's normalized so that a value of zero corresponds to an average expression level and now I'm plotting these versus position with the integration event here drawn here in red and then this is for a million base spares on either side of the integration site and here are the various genes that are found there when we find once again is that downstream the viral downstream of the viral integration event the the exon expression is is quite high the next question we want to to see is is this a general pattern observed or is this something that we see just in this particular case and so here is a distribution of per exon expression for for samples in which which have a copy number increase in the in the integration site and we consider the expression upstream for a hundred thousand base pairs within the integration site and also downstream of the integration what we find is that upstream of the integration event we tend to find the distribution it's relatively broad but it has a significant population both below and above zero whereas within the integration event where there is a copy number increase we find that the per exon expression is tends to be upregulated and downstream it tends to be upregulated as well generally speaking so this is for a copy number increase and we can also consider expression for a copy number decreases and what we find is that within the integration site the per exon expression is downregulated upstream it's tends to be around zero and there are two different populations upstream it's around zero there are two different populations with a population around zero and then a population with a significantly increased expression so to summarize we've developed a pipeline which allows the integration of multiple data types for the analysis of viral integration we use RNA seek for virus discovery as well as for expression analysis and whole genome data as well as exon data for integration analysis and we've constructed a unified visual representation which allows us to combine visually for one sample these integration events and finally we've illustrated a close association between virus integration copy number domains and expression levels and with that I would like to thank my PI Lee Deng as well as the other lab members and others who have contributed to this project and I'd like to thank you and if there's any questions I'd be happy to answer have you looked at what distinguishes tumors which have integration versus those which don't have integrated virus in terms of whether there's a different mutational spectrum or other alterations in the non-integrated tumors that might predispose them to survive without integration and what the reason I ask is that there have been a number of reports over the years that tumors that are that are HPV positive have a portion of the tumor that becomes HPV negative and progresses or cell lines that have HPV and vitro initially lose HPV and still maintain a mortal state so you have any insight into that the question being are we observing differences between tumors where there is viral integration and whether there is not we've started looking at doing various clinical correlations between these I'd say that right now it's it's still in the early stage of the analysis and we haven't observed any any particular differences yet but that's something that actually is an ongoing work sure very nice talking really nice analysis on the details of viral integration and expression just want to point a couple papers in the literature for you one is focused on analysis of TCGA data and it's from Eric Larson's lab I think a former postdoc of Chris Sanders that was in nature communications last year another is from our group from Aquino Justina and Chandra Petamano on cervical cancer in nature at the end of last year and both of those I just noticed one of the genes that you mentioned rad 51b for example both of those papers highlight that that's right of course is this pipeline available online to the public I'm sorry is this pipeline available online to the public this is something that's under development right now I expect that it will be made available thank you in the case this rad 51 where you see the later exons being highly expressed are they part of a fusion with with some HPV gene we we haven't the short answers I'm not sure I we haven't analyzed that let's thank all the speakers one more time please so this concludes our session and now Matt Meyerson and Marco Maro will make some closing remarks