 Hi everyone, thanks for joining us and as usual I'd like to begin by thanking our funders, the Chan Zuckerberg Initiative for supporting outreach events by NFCOR. Just a minor detail before I start today's session, the stock will be recorded and it is being recorded at the moment and the video will be uploaded to our YouTube playlist and I will be sharing the link on our website and on Slack so don't worry if you've missed it you can catch up later. So I'm delighted to tell you today that we're joined by Reagan Hayward who is based at the Helmholtz Center for RNA-based Infection Research in Germany and he will be presenting the NFCOR dual RNA-seq pipeline. It's a pipeline that is used to interrogate host-pathogen interactions through simultaneous RNA-seq. There will be time for questions at the end of Reagan's talk and you can either use the chat function at any time today or unmute yourselves at the end of the talk and ask them directly. So thanks for joining us today Reagan, I'd like to hand over to you now. Thank you for the introduction. Right so the structure of my talk is going to be pretty similar to a lot of the other bite-sized talks as in talking about a little bit of background first of all, some mention of the pipeline and then some future directions as well. Something that's important to start with what is dual RNA-seq? From the RNA sequencing it's simultaneously capturing in this instance a bacterial pathogen infecting a host cell and through bioinformatic means we're able to assign the bacterial reads to the bacterial transcriptome and the host reads to the host transcriptome. There are some challenges with read assignment with dual RNA-seq datasets and runs through a few of these on the left hand side. So for example here we have a read which is being assigned to gene A. We're quite confident that we can happily assign that read gene A here. When the read overhangs the gene slightly we're still pretty confident that we can say that this read has belonged to gene A. When we have multiple annotations overlapping and the read occurs within this overlap, this makes it a little more challenging. Perhaps we want to just say that this read is being assigned to gene A or is it ambiguous or do we want to take a proportion of the read and assign it to gene A and gene B. We're going to read multi-maps, multiple genes. What do we want to do in this instance? Do we want to say, do we want to count it at all? Do we want to do a proportion or do we want to say it's gene A and gene B? So these multi-mapping reads are a bit of a challenge especially when we can concatenate the genomes for dual RNA-seq studies. So we've got into an intro species challenges. And this illustration I show on the left hand side is both for host and pathogen reads. And depending on the infection ratio, generally, we have a much larger proportion of bacterial reads in the sample. And so it becomes really important to try and assign as many bacterial reads as possible and as accurately as possible. So I'll spend some time on the next couple of slides talking about the bacterial transcriptome architecture, which I think a lot of people probably aren't aware of. A lot of people are probably more aware of the host side, eukaryotic side, and how splicing occurs during the splicing events. With the bacterial transcriptome architecture, bacterial genes are grouped into operons. So on the right, we have a monosystemic operon with single gene and a polysystemic operon with multiple genes inside. The operons are generally flanked by five and three prime untranslated regions. And within an operon, genes are co-transcribed into an mRNA transcript. In aluminum base sequencing, the transcripts are fragmented into reads and the reads are assigned to genetic features. So this example here, the red reads aren't being assigned, but the blue or cyan colour reads are being assigned to a particular gene. This brings about a few challenges. For instance, many of the bacterial annotations aren't actually complete. If you look outside in the model organisms, such as E. coli, salmonella, and bacillus, a lot of the annotations don't include any of the UTI regions or a lot of small RNAs and even complete genomes. So a lot of the bacterial species would be in just contigs or scaffolds. In addition, a lot of the bacterial genomes contain a number of highly repetitive bacterial sequences as well, although so that can be difficult to assign reads to. So going on to that a little bit further, we've worked out as a uniqueness score per gene in each bacteria. So it's a canada-based approach and we're looking at each gene. So we assign a number of canas to each gene and depending on the uniqueness of those canas, it can be seen in other genes and then we can assign a uniqueness score per gene to each of these dots. If the canas are unique, that means the gene gets a uniqueness score of one. If all the canas appear in another gene, that means that gene would become a duplicate. You can get varying levels of uniqueness per gene. We've got a, so this is dictated by the color, red being a duplicate and gray not. We've got a, we'll set a cut-off at about 50% saying that anything below is considered to be repetitive. So for chlamydia, which is a gram-negative obligate intracellular bacteria, most of the genes are quite unique. If we expand this to other bacteria such as mycobacterium leprae, driptococcus pneumoniae, salmonellatiferium, and orientia, tutu-tsigamchi, we can see quite a difference. So if we look at the contrasting orientia, which is a gram-negative obligate intracellular causing scub-typhus, if you're interested, contains a lot of repetitive elements as you can visually see here. It goes into a table form, for example, so the duplicates are the ones in red and the repetitive is anything below a cut-off of 50% uniqueness. You can see chlamydia has a total of eight genes. It would be challenging to assign reads to and orientia, for example, has over 1600, I think which is about 60 or 62% of the genome. So something to keep into consideration. Yeah, and both genomes also contain a lot of repetitive elements as well, which I think a lot of people are probably quite aware of. For example, the host genome contains about 45% repetitive elements. Human genome about between 50 and 70%, and in some of your RNA-seq studies, if you're looking into different plants, for example, the maize genome has over 80% of transposable elements. So most your RNA-seq experiments typically, we use genome-based approaches such as star and maybe feature counts or HTC, something like that. In the pipeline, we're introducing a transcript-based approach. Within the pipeline, we're using salmon for this. So salmon has two modes. It's got an alignment-based mode, so that uses an existing tool, let's say, star to align the reads, and it'll use that BAM file to quantify. The second mode is selective alignment, which does a pseudo-alignment step in quantification, so it's containing some salmon itself. And one of the advantages of using salmon is it uses this expectation maximization algorithm. It's not just salmon, I should say, other software such as Callisto Express RCM. They use this algorithm as well. So it's going to assist in assigning some of these multi-mapping reads and also reads to some of these repetitive sequences and others through an iterative process. And I'll give you a really basic example. So the first step would be assigning all of the uniquely mapped reads, so ones that have a really high confidence. And then it would go through an iterative process assigning the remaining reads. So for example, after the first step, maybe GNA has zero reads and GNB has 100 reads. And there's a read that could be assigned to GNA or GNB, then we'd have a much higher probability of this read being assigned to GNB. And so using salmon, we've found through a lot of benchmarking, is advantageous for assigning reads from dual RNA-seq data. So we'll talk about the pipeline a little bit now. So some of the inputs, we have Illumina-based sequencing reads and a host and pathogen genome and annotation. We've got FastQC, need for some quality control steps and also adapt to removal and re-trimming through BBDAC and cut-adapt. And some preprocessing steps of merging the host and pathogen genomes and references to create this kind of like chimeric reference that we use. For a parallel read mapping quantification steps, we've got a more traditional genome-based approach which is using star and HTC in this instance. And then we have our two transcriptome-based approaches using star and salmon confinement. For the traditional genome-based approach, you can just supply a host genome annotation and pathogen genome annotation. But for the transcriptome-based approaches, you will need to supply a host transcriptome and the pathogen transcriptome is created automatically in the pipeline. And the pipeline output separate host and pathogen features in various reports and plots to include correlation plots for both the host and pathogen samples per condition. Also a proportion, we also get a number of reads showing the number of uniquely mapped host reads, uniquely mapped pathogen reads, multi-mapped host and pathogen reads, cross-mapped reads or cross-mapped between species, unmapped reads and trimmed reads. And you also get the biotype breakdown per sample as well. And depending on which method you use or if you use a combination, you get this output for each method. So status on the pipeline? Well, there's just a number of performance improvements I need to include. I need to update to the latest template as well, which you'll be doing shortly. And I'd like to include some additional outputs, it's graphical outputs and some data output as well. So an example of that is some WIG files that are separated by the host and pathogen. So I'd like to migrate to DSL2. And probably one of the questions I get asked the most when I talk about this pipeline is, is there support for all JOR and AC data sets? And at the answer to that at the moment, no, it's just bacterial because based on this transcript demo architecture I spoke about earlier. And so we're considering support for viral host pathogen data sets. And so if anyone's interested in this particular bit, I'd be curious to talk to them about some of the features that they'd like to see. So now I'm going to try and include those in my next update. I'd like to thank everyone to Harry, the NFCORE team and community, and the salmon development team as well, for their help. Yeah, thank you.