 Hello everyone to today's Bites us talk I'm very happy to welcome again Alisa Briggs from the University of Texas at Dallas and today she's talking about a pipeline called viral integration and off to you. Thank you for the introduction and thank you anyone for joining us at this different time or online later since it's at a different time. So I'll go ahead and get started talking about the viral integration pipeline. Alright, so a quick overview before we get started. I'm just going to go over kind of a background on the viral integration of the concept and then any other concepts that are important for understanding the pipeline itself, as well as the pipeline overview so kind of just how it works what inputs as a user you're going to need to give the pipeline what you're going to get, etc. And then, in addition, I'll talk about some current work that I'm doing with camera agreed so camera agreed alignment is the method that this pipeline is going to be using. It's applicable to viral integration, however, it also has a lot of other interesting potential uses so I'll talk about some that I'm working on, and then future direct future directions and enhancements. Alright, so first things first why viral integration. So viral infection can allow the viral genetic material to be integrated into here we say the human genome but it can be any host genome. And that of course can lead to deleterious mutations, especially if you're getting these insertions into protein coding regions. This is of course pretty common with retroviruses which of course do the retro trans transposition into the host genome. However, it's not strictly limited there is some studies showing that other classes of virus can lead to viral integration into the host genome as well. Okay, so there's going to be 15.4% of cancers that are already attributed to some type of infection with 9.9% linked specifically to viruses. A well known example of this is HPV 16 and 18 leading to cervical cancer. This is, this is why it's important so reliable and reproducible viral integration pipelines can aid in confident identification of either known or novel viral associations with cancer and other diseases. So this is going to create opportunities for us to find methods of prevention for such diseases or possible treatment so again with the example of HPV. The next scene is recommended not just to protect you from the infection itself, but from the cancer that can fall. Alright, so next up we have primary reads. So again this pipeline is going to use a chimeric read alignment method. In order to find those potential violence creation sites so what is a chimeric read a chimeric read or split read means that multiple subsections of that read aligned to different positions in the genome. And this can even be different reference genome so in our case it's going to be a host genome, and then a viral genome. So, in particular chimeric reads I think the most common way you're going to see them or that you've probably heard them talked about is as errors either an amplification or sequencing. So a lot of times, these will just get tossed out, because people think they're just errors and a lot of times they are, but there's also sometimes where chimeric reads can be very helpful to us. They have been used to analyze interact zones so for example, if you have RNA, RNA interactions, you can do RNA seek with those while they are in complex, and you can analyze how they're interacting with each other. They're also useful for identifying structural variants. So these ways are kind of closer to how viral integration is going to leverage chimeric read alignment. So in order to identify those reads that contain a portion aligned to the host genome and a portion of lining to the viral genome. So we have this diagram down here. If you can see the green section will imagine that's like the host genome, the orange section imagine is the viral genome. And then you have these sequencing reads that are going to be overlapping this junction between the two. So if you're mapping these reads, and in this case it's paired in reads, you're going to have part of it aligning to the host and part of it aligning to the virus reference. And that's how you get and leverage your chimeric read. Okay, so that's kind of the background information that you need to understand the pipeline. What about the pipeline history itself. So this pipeline came from an existing middle pipeline created by Broad Institute. This was from the Trinity Cancer Center transcript analysis toolkits, and specifically the tool was called virus integration finders so for short, we'll call this CTAP VIF. And this pipeline was capable of capturing evidence of virus matching reads. So that's consistent with viral infection it doesn't necessarily mean you're integrating virus into the genome. It's just some evidence that viral leads are here. It's able to identify and quantify evidence for viral insertion sites. So this is the integration into the genome. In the CTAP VIF pipeline, it was only able to look at the human genome. So that was kind of one drawback of it, but still super great pipeline. It also produces interactive visualization for evidence of virus mapped reads and virus insertion sites so whether you're looking for just evidence of viral infection, or evidence of viral integration into a genome. It's going to provide some nice visualization for both. And then lastly it includes this nice human virus database so they already compiled a lot of different human viruses into a FASTA. You can align and use that as a reference. This was ported over to our pipeline as well. So if you have are using human as your host, you already have a virus database ready for you. Okay, so when did this conversion start what spurred it. There was an interest in the chimeric read alignment approach, as well as how that could be leveraged to look at viral integration, or potentially other genetic events. So I started this conversion in April 2022. This was my very first step into bioinformatics was the first time I ever really looked at code or did anything related to that. I did this under the mentorship of admin Miller here at UTD in our functional genomics lab. This was a great experience both to have him helping guide me through this process as well as using this existing pipeline and basically learning the syntax of next flow, learning how NFCore pipelines are put together, etc. So this was quite a long process just because this was the very first time I was learning any of this, but the first release PR was opened in January 2023 and subsequently released in March about a year after the conversion started. This is going to be equivalent to the CTAP VR version 1.5, which was released in December 22. So all the scripts and everything are going to be up to date. This is still the current version so everything is currently maintained. So it's this kind of one to one mapping between the pipelines has happened so that we could get to next phone NFCore, where things are a little bit more flexible and reproducible. Now it's time to work on enhancements kind of make the pipeline our own and NFCore is a bit. All right, so this is going to be the Metro maps precursor there is no fancy Metro map yet. I'm working on organizing into subwork flows, and I'd like to do that map when it's all done. But this you can get the general idea, how we're sectioned off right now in these boxes they aren't subwork flows themselves they're just kind of general groupings of the processes. So here we're going to have pre processing and host alignment. So this is where your reads are going to be aligned to the host genome. You actually don't align to the host genome, those get passed forward here and go through some viral read alignment and reporting. And then in these steps you're also going to find your first candidates for insertion sites and your Chimeric reads, then in this this grouping down here Chimeric read identification and reporting, you'll be kind of validating those Chimeric reads, and then reporting them as viral integration events. That's just a very brief overview of the pipeline. So what are the inputs as a user what do you need to provide. So of course with most NFCore pipelines you're going to need the sample sheet in a CSV format. This is going to take NGS reads they can be either DNA seek or RNA seek. Currently the pipeline is defaulted to trim Illumina adapters but this can be adjusted in the modules config file for any other NGS sequencer as long as you have the adapter. So supports both paired and single underreads so currently paired end is going to be supported on the dev master branch with like stable releases single and reads is currently supported on its own branch. So that's also in progress working to support that on our lease. Okay, so those are the reads. And then you also need to include the reference of the host genome so this is going to include a FASTA file and a GTF file. So you will need a database of viral sequences of interest so this is going to be a FASTA file. Again if you're looking at humans. There's already one of these provided with the pipeline from C to have the F but if you're looking at a different host organism, you will need to provide this fast. Okay, so these are the alignment steps. This is kind of where a lot of the meat of the pipeline takes place a lot of the important steps. It's really confusing because there are three different Starline steps. So I'm just going to go through this. So Starline host, as I had mentioned, is where you're going to be aligning your reads to just the host genome. Okay, so any reads that are lining to the host genome and any chimeric reads that we're finding just between the host genome and itself, we're kind of going to push those aside and not look at them any further. You can also output a fast queue of unaligned reads. And this is what's pushed forward into the rest of the pipeline. So those reads are going to go through some more processing and eventually you'll hit Starline plus. So before you hit Starline Starline plus, you're going to have this step that combines your host and viral fastest into one FASTA reference. Okay, and the Starline plus step is going to be aligning your previously unmapped reads to this combined reference. Okay, so of course you're going to get the output of the BAM file as usual with those alignments, and you get this chimeric junction file. So this is a important feature of Star, especially for this chimeric read approach. This file is going to contain instances where you have reads mapping to both the host and the viral reference. And then finally later down the pipeline, you're going to hit a process called Starline Validate. This will essentially take your chimeric reads and where you think you might be seeing viral integration sites, runs alignment again with various inputs and confirms that those are potential viral integration sites. Okay, so this is the chimeric junction file that's going to come out of Starline plus. So the columns aren't quite aligned but we can look at it with the color coding so we have the chromosome donor A, and then the chromosome acceptor B in green here. Okay, so if we look at these first few entries, we see that the chromosome donor is going to be HPV16 and the acceptor is going to be chromosome 18. So this is a good example of where you're seeing that chimeric read, part of it's aligned to the virus, part of it's aligning to the human reference. So where does it go from there? This is kind of the same diagram just zoomed in on the later sections of the pipeline. We're just imagining we start here with the Starline plus outputs that includes your chimeric junction file and your BAM file. So these are going to go into IR support which is kind of a reporting step in the middle of the pipeline. There also should be an error here it goes up into insertion site candidates. And from there you're really analyzing these chimeric reads for their potential to be viral insertion sites. Okay, so you look at their potential, then they come through this extract chimeric genomic targets. And then you're going into the validation steps. Okay, and once you've validated everything there's a little bit more processing that goes on here. And finally you get to summary report which is outputting some really nice visualizations which we'll look at now. So this is kind of the primary data visualization this is going to come out of both virus report and summary report. In summary report it's going to be a little bit more refined thanks to those extra processing steps. So what you're seeing here this is just a bar graph showing the virus mapped reads so this teal color here these are the reads that are being just mapped to some kind of virus. They're not necessarily inserted into the host genome, but they're somewhere in your reads. This kind of pink color these are your chimeric reads. So these are what you're looking at for evidence of that viral insertion into the host genome. There's a difference between the top and the bottom. The top is just run on the full human reference, while the bottom is our optimized test data set where you're just running on chromosome 18 for efficiency. So if you get chimeric reads you're also going to see some dots showing up on this genome wide abundance plot. So what this is showing is basically how many different instances are supporting a chimeric read on a given chromosome. chromosome 18 specifically here, you're going to see how many reads are supporting that there is a chimeric event happening here. So down here it's about 128. For some reason it's, it's about 150 on the full genome run. So then you can also see for the full genome run on 11 and six you're also getting some chimeric reads on those chromosomes. The nicest output that this pipeline gives is going to be this IGV JS. So this is really just an IGV it's just that you're not going to the online web application it's kind of outputting its own self contained file here. And this is going to be showing you basically the alignments and how they're split between the two references. So you have chromosome 11 over here and you have HPV 16 over here, and you're seeing how your paired reads are mapping between the two. So in a lot of instances you'll have one mate of the read pair mapping to one reference, one mate of the read pair mapping to the other reference. But occasionally you'll get these nice reads where they're aligning directly on this junction. So within one single read one, one, one side of the read, you're seeing that it's aligned to both the host and the virus reference. You can zoom in which is also nice feature to get this base pair resolution. This way you can kind of look more at the sequence you can see where there's mismatches occurring. So for instance down here there's a T mismatch or mutation. This is going to just help you have more confidence I think in the alignments. So that's exactly what's going on at the nucleotide level. This is also going to allow you to see how far these chimeric arms extend. So how much of each reference does it cover. And there is kind of some debate on how long should your chimeric arm be for this to be a confident and a significant alignment. Okay, so that's pretty much a short overview of the pipeline what else am I using it for at the moment. The third approach is able to be applied to lots of different topics. One of particular interest to me was novel transposable element identification. So I've been working on looking at how to apply this viral integration pipeline to also look for transposable element assertions within the genome. So it's kind of the same idea you're having some genomic element inserted somewhere in the genome that it technically shouldn't be. It's a little bit more challenging with transposable elements. So my current ongoing effort is to optimize the parameters and deal with the bias towards these repetitive genomic elements, as well as the fact that these are endogenous to the genome, rather than being a viral element which is kind of completely separate from the host genome. The first step was just to place the viral fast with the tea faster. That was easy enough. And then some trial and error kind of led to the conclusion that we should skip the Starline host step. So if you align these reads to the host genome because transposable elements are endogenous. A lot of them, actually all of them in my case, we're getting filtered out in that Starline host step, and we had nothing to work with after Starline So if you skip straight to Starline plus where you're looking at the human reference and then kind of also putting an emphasis on the transposable elements. You start seeing more of your results. And this still needs to be optimized but I thought it was interesting to see how else this pipeline can be used. So future directions. As I mentioned I have been working on the sub workflows those are all laid out and pretty much ready to merge I just need to make sure they're all still functioning properly since I wrote them a little while ago. Also need to NF core I some of the dense local modules so because of how this pipeline was set up. It ended up that there were some local modules that have about 10 Python scripts running in them. Which isn't super module and not super easy to maintain once there's going to start being updates to these scripts and containers. So it would be it would be nice to separate these out into smaller local modules and then perhaps sub workflows from there. Additionally, there's always the battle of optimizing your parameters tests and test data sets as I mentioned the parameter for chimera arm length. That's a particular one of interest. Just to see basically when is your alignment, maybe just by chance or mismatch and when can you start to have some pretty good competence in it. Additionally, some enhancements I'd like to add I've been talking about this one forever but adding metaflan three. That would be helpful. Basically, you can look at the microbial profile of your reads and see what viruses are showing up in metaflan three via those reads. This would help to confirm your viral alignments that you're getting later on in the pipeline. If you look at your metaflan three oh I have HPV 16 and HPV 18 showing up in these reads somewhere, and then you see a viral integration event later that has HPV 16 or 18 that can increase your confidence in those integration alignments. Additionally, I need to add support for aligning long reads. So that'll include adding another alignment method. That would be super helpful, especially for aligning chimera greeds. And I'll continue the development of identification of chimera greeds involving TEs and their respective profiling. And then I love to do more collaboration with this pipeline. So although it's been self imposed, most of this has been a solo effort or a duo effort when I was working with Edmund on it. I would love to have more of the community involved. Here anyone's suggestions and ideas for it. And really just have more of this collaborative effort. So if anyone's interested you have ideas. You think something should be done another way feel free to drop an issue in GitHub or in our Slack channel. I'd be happy to hear about it and work with you on the code about it. Okay. So these are just the references for any of the figures and or statistics that popped up throughout. And that's all. So yeah, again, if you would like to jump on the repository submit an issue or chat with us on the viral integration channel on Slack, I'd be happy to talk to any of y'all. So thank you for coming. Thank you for watching. And yeah, if anyone has questions, I'll be happy to take them now or later on our Slack channel. Thank you. This is not a time exactly for questions. So if anyone. Please ask now. But maybe I can ask one. So why did you choose to star as your liner. I'm a star. Yeah, so there was some question between like star as it's, it's RNA seek optimized right not necessarily DNA seek optimized. But star have that really great feature for identifying the chimeric reads and split reads, along with outputting that chimeric junction file. So as to the entire why it is just how it was in CTAV VIF. So this was kind of a one to one mapping, but I think that's kind of the argument for it as to how well it maps those chimeric reads. I think there's actually quite some overlap with high C pipelines, because they also look for chimeric reads. I don't think they use star at least the analysis that I do is with BWA and then pair tools. So maybe you could have a look into that if you're interested, but it's the same kind of problem because they also look for chimeric reads. More in a genomic context then of course, not transcriptomics. Yeah, I'll have a look at that then. Cool. Are there any more questions. Anyone can now unmute themselves if they want to. But if not, then I would like to thank you for this really good talk. And I also would like to thank the audience that's here. And the John Zuckerberg initiative for funding bikesize talks. So thank you very much everyone. Thank you.