 Hello everyone, my name is Franziska Boonat and I'm very happy that Gisela is with us today from the University of Tübingen and she is giving us an overview of what NFCOR airflow can do. Thanks Franziska for the nice introduction. So I present NFCOR airflow and first of all I will start with defining like what's the air in airflow. So air stands for adaptive immune receptor and that's the collection of membrane proteins that are found on the surface of B cells in which they are called PCR and on the surface of T cells in which case they are called TCRs or T cell receptors and PCRs or B cell receptors in their secreted form are also called antibodies which is a term that we are all more familiar with. So the main function of these receptors is to be able to recognize foreign antigens that are inside the human body that can come for example from pathogens such as viruses or bacteria and to elicit an immune response against them and to be able to recognize so many different antigens from different pathogens these receptors also have to have a variety of different sequences and so it's estimated that in the human body at a time point there's 10 billion to 100 billion different receptor sequences PCRs and TCRs and air sequencing is about getting the individual sequences of these B cell and T cell receptors and that has a variety of applications which can vary from determining the immune state of an individual at a specific point studying immune related diseases guiding vaccine development or guiding cancer immunotherapy. So I'm going to talk about a bit more detail on how this diversity of the TCRs and PCRs is generated. So in the human genome there are different B gene segments, D gene segments and J gene segments that can be and C gene segments that can be combined to form these TCR and VCR receptors in case of humans there's four TV genes segments 23 D gene segments and six J gene segments and what happens in the B cells and T cells is that each one segment of each kind is combined from a productive TCR or PCR receptor and that happens at the DNA level so there's a process that's called somatic recombination that alters the genome of these cells and generates this productive TCR and VCR receptors and the combination of these different G gene segments doesn't happen like label blocks just like attaching them one next to each other but rather in a cut and paste procedure and the cutting position is not exactly always at the same spot and there can also be some nucleotides incorporated into this junction regions so that there's extra variability that comes from this step that is also not genome encoded and in the case of the B cells or in B cell receptor there's even another process that generates more diversity of these receptors that happens upon antigen stimulation so that's what happened to all of us for example when we were first in contact with the coronavirus or the coronavirus vaccine and some B cells in the body were able to recognize the antigens in the coronavirus and they were stimulated and underwent clonal expansion and that is generating a lot of children cells that belong to the same B cell clone then and in a manner that's not all of these children cells have this exact same BCR receptor sequences but there's a process called somatic hyper mutation that introduces mutations in these BBJ segments so that each of the children cells has a slightly different receptor sequence as well and this allows also to generate B cell receptors and therefore antibodies that have even higher affinity to the original antigen so you've seen that there's many different processes that contribute to the diversity of these TCR and BCR sequences including somatic recombination, variable junction length and in the case of the BCRs also somatic hyper mutation and this means that theoretically there could be more than 10 to the power of 14 possible BCR sequences. So how do we how are they sequenced and most protocols use what's called Amplicon sequencing which is the targeted amplification of this gene locus and that can be done via different protocols including multiplex BCR so providing primers for all the kinds of different sequences that can be there but also five prime race amplification protocols are pretty common and also the sequencing protocols can incorporate what's called unique molecular identifiers which allow correcting for sequencing errors down the line and errors introduced by the BCR amplification process and typically for sequencing mySEC sequences are used because they allow for longer read lengths that covers the complete bdj and beginning of the C region. So how is the bioinformatic analysis done for these kind of sequences it typically does not happen like a traditional RNA-seq analysis and that's because with mapping to a reference genome is challenging in this case due to the high diversity of these BCR and TCR sequences and also because it's a highly repetitive genome region with all of these bdmj gene segments there. So for this kind of analysis specific tools are used and the reads are aligned to specific reference data also for this BCR and TCR receptors. So I always say luckily for us like when wanting to write a pipeline to do this kind of analysis there's already plenty of tools out there that can analyze this data one of the better well-known ones as the incantation framework that is developed by the Kleinstein lab in Yale and it's an open source tool set to analyze ASIC data from beginning to end. There's a whole community of users already using this framework and here you can also get the details in case you want to have more information. So thanks to developing this pipeline as part of the NFCOR community we gained visibility and we quickly found quite some collaborators to develop the pipeline. So I want to mention that this is really a community-based development effort to Zala Marcus from the incantation lab joined quite at the beginning and also David Blatt from Monash University helped in adding some features there whereas initially Alex Pelsa when he was still at cubic with us Simone Hoymos and myself were also developing the airflow pipeline. So now to the details of the pipeline and those are the main pipeline steps. First when the pipeline can process both bulk AR sequencing data and single seller sequencing data when starting from bulk there's first a step of quality control of the sequencing reads and sequence assembly and afterwards there's a process of that where the reads are aligned to the references with IG Blast and the reference data is typically employed from the IMGT consortia which provides reference data for BCR and TCR. At this step also already assembled data can also be provided and for single cell data typically we start at this at this step. So afterwards there's a step for clonal analysis which identifies which of the sequences of the PCR belong to which B cell clones so it assigns the individual BCR sequences and C-R sequences to their specific clone and in the case of the B cells it can also perform line of reconstruction of the whole B cell clone and finally there's a step for reporting doing repertoire analysis and reporting including QC reports via multi QC. So that's like the pipeline general steps but now if we look a bit more detail there is like a ton more processes and little processes that are part of the pipeline and I'm going to explain them a bit in more detail now. So starting for the QC and sequence assembly and the pipeline supports different sequencing protocols including multiplex PCR in which the users have to provide the B and C-primer sequences that were used for the amplification or in the case of five prank rays providing the C-primer and the linker sequencing sequences for amplification and both protocols are supported with and without UMI barcodes and the barcodes can also be provided in different configurations. So starting from the raw sequencing data then sample sheet needs to be provided that contains sample information and the individual fast files for all samples. So then depending of if the the sequencing protocol includes UMI barcodes or doesn't include them there will be some processes or others but they all start with quality control of the reads with fast QC filtering the sequences by quality threshold masking the primer sequences and if a UMI based protocol is used then a consensus is built from all the sequences that have the same UMI barcode and this way it also allows to correct for the errors as I mentioned before. There is also an extra procedure that is employed whenever it's estimated that the length of the UMI barcodes will not be sufficient to cover all of the diversity of the sample and that is bypassed by clustering first all the sequences by similarity and annotating the blaster ID and then two different sequences with the same UMI barcode can also be distinguished this way. So after building the consensus then all the sequences that contain the same UMI barcode that are collapsed and count the number of sequences with the same UMI barcode is annotated and other metadata is also annotated and also duplicate sequences with different UMI barcodes are also collapsed and their count is annotated which can also be used for filtering there. After this step there comes the VDJ assignment and filtering step and here it's also possible to start with already assembled sequences that can also be provided with a sample sheet and faster files and also typically single cell sequencing data processing starts at this step and this is because the pipeline supports directly the output from the tool 10x genomic cell ranger multi which provides what incorporating also TCR and VCR sequencing in the 10x genomics sequencing procedure then the output of that tool is the air rearrangement table which contains all of the sequences there and so that can also be directly provided to the pipeline at that step and optional so that step what it does it's assigns well it's aligns the sequences to the IMGT reference and it then assigns what's exact to be the MJ segments where you stare and in the case of the single cell data there's an optional this genery assignment step is optional so after alignment to the IMGT reference there's also a number of quality filtering steps that are performed first it's checked that the locos matches exactly the V call so the VB segment assignment that there are a minimum of 200 informative positions maximum 10 percent and nucleotides that the sequences that are determined are productive that the junction region is multiple is a multiple of three amino acids and there's also a possibility of removing chimeric rates detecting contamination across samples and finally collapsing duplicate sequences if there's any so plenty of quality filters as part of the pipeline the next step is clonal analysis in that case and hierarchical clustering is used based on the humming distances between sequences and the pipeline is also able to auto detect a humming distance threshold that can be used to determine which sequences are part of the same clone or are part of different clones and there is also a step for lineage reconstruction of the clonal lineage trees and new recently added are well so recently the pipeline uses the enchanter tool which is developed by Susanna Marquez in incantation and that tool provides calls to other incantation tools and also nice reports for each of these steps so i invite you to check them out here and finally there's a repertoire analysis step there an rmarkdown report is provided that summarizes the repertoire analysis results for all samples and here of notice is also that it cannot a custom rmarkdown report can also be provided in case that the user wants to change some things in this report so it's also possible to provide an online rmarkdown file and other reports of this reporting analysis steps is the multi qc qc report for all samples from the grid quality control reports so here we will see an example of this repertoire analysis report first there's a summary of all the samples used for the analysis and clonal abundance and clonal diversity are reported together with vision usage and finally all of the tools that are used in as part of the pipeline and their citations are noted here to make it really easy for users of the pipeline to also cite the original tools that are being used there as you know all documentation for NFCore airflow can be found on the NFCore website so check it out if you want to use it there there are also some example results of the pipeline when the full tests are run on aws and what's next for this pipeline so stay tuned for a new release that comes real soon we hope this week or the next week and that includes more quality control and reporting as part of the enchanter tool as i've mentioned versus an market and code refactoring using support tools and at this point i would like to thank all the contributors to the pipeline simon simon heimus and myself at cubik um alex belza who was initially at cubik but now is at buddinga ingelheim uh susanna marquith at the clanside lab in yale uh david latz and monash university and some collaborations also at the university of tibingen christoph russial and marcus kovarik and if you have any questions don't hesitate to join us join the airflow channel um on the NFCore Slack and if you have any questions related to the incantation tools here you also have the contact emails to contact them directly thank you very much um everyone can now unmute themselves if there are any questions um maybe i i start with one i'm curious um in what format does airflow expect umis to be provided so does it have to be in a separate file or should it be in read one or in read two um it's of course all kinds of these configurations that you have mentioned so you can provide them it depends on the on your library design where the um i of our codes are located uh sometimes they are part of the r1 and r2 reads and sometimes they are part of the index reads so it can be provided in any way there's just there's some parameters in the pipeline where you can specify where the um i of our code is located r1 read r2 read or index files so everything is supported in that case thank you are there any questions from the audience it doesn't seem so then i would like to thank of course gisela but also um the john zuckerberg initiative for funding the bite size talks and as usual if you have any questions go to slack um for at uh NF core airflow and uh ask questions there and then thanks again yeah thank you francesca