 It's okay. Hello everyone and welcome to this week's Spitesize Talk. I'm very happy to do today with us is Jonathan from Sekera and he is going to talk about the NFCore pipeline differential abundance and off to you. Thank you for the introduction. Yep, so I'm going to just spend a quick few minutes talking about differential abundance which is maintained by myself and by Oscar Wacker. So before I kick off I just wanted to clarify some terms. By abundance here we're talking about the the magnitude of matrix values. I'm aware that abundance is a bit overloaded in some communities but here it's just a term we wanted to use to be more generic than expression. When I talk about features I'm talking about the individual variables and measurements of the other matrix which is in the expression space I'm talking about genes or transcripts and by observation we're talking about the individual samples or experimental units and by covariates we're talking about a variable that possibly predicts the outcome under study but it's not of the primary interest. So for example batch is one that comes up quite often in expression data and just kind of reinforce that if you're used to dealing with expression data in the R world you might use this sort of structure where you have a central assays concept where the matrices are stored you have features which is where the metadata associated with genes and transcripts are stored and you have samples which is where the the observations are stored and in kind of the core world that's a sort of a sample sheet basically these observations are metadata there. Another important concept for differential analysis is the concept of contrasts so in the top part of this slide we have a sample sheet I've highlighted the condition column here and in order to to define a contrast we might use that condition column and say that we want to compare the samples that have control in that column and the samples that have treatment in that column and to extend this very slightly we have this concept of blocking in the sample in the contrast variables we use here which is used to incorporate the sort of batch covariates into the differential expression modeling and we'll get to that a bit more than more later. So the objective of this workflow was really to act as a unified point of access for differential analysis of matrices with diverse data types of which are I think is is is the most widely used but we we also have expression arrays and there's a proteomics pathway through there as well but basically we want to share common components in these sorts of analyses across data types so when we have matrix filtering we have graphical representations like volcano plots and these can all be shared across different different data types and then we have some some high quality reporting at the end that we want to use to make all this all this pretty. So and so yes we do have this kind of similar approach across data types you have a process that gets data you have some exploratory analysis going on you have to filter the matrices in some way often removing zero all zero rows and we want to compare compare groups and produce a report a report at the end and that's common across all across all these data types. So the current status of the workflow looks something something like this if I could get you to concentrate just on this on this middle row here we have four key inputs to the workflow we have feature annotations we have the abundance values we have the observation annotations so the sample sheet I just mentioned and we have the contrast definitions and these all four get combined we we validate them to make sure these things are consistent among each other we filter the matrices to to do this sort of zero filtering and so on that I mentioned and then there's a multiple different or currently two different differential analysis methods used depending on the input data type and then we proceed to optional gene set enrichments and exploratory analysis differential analysis and the the final kind of reporting steps. I also want to highlight the this top left corner of the tube map here this is just the different input methods that we have that can be used to generate these these these primary inputs basically so for example we have an input method using the get you a module which uses a geo identifier to pull a matrix from from here and we have a pathway by which you can specify aphor matrix array intensities and they again get converted into a matrix which can be used in the downstream analysis so these are just kind of different entry points into the into the workflow to generate these common inputs the generic usage looks like much like a lot of other and of core workflows we have an input sample sheet here we have a matrix which is the dependence values and the gtf file so the the gene annotations and then the contrast definitions that I mentioned before obviously we have a lot of other parameters but these are kind of the key the key inputs that you that you have and then this will will work in a minimal sense um to dwell a little bit more on the r and a seek case um r is the end of corny seek module outputs um it's it's matrices from using the tx import our package or tx i meta input uh package um and that is useful for accounting for the biases in length of transcripts across samples so it's very important to adjust for cases where the differential um isoform usage across treatment groups can lead to differences difference in length so if treatment group a um uses primarily very short isoform and treatment group b uses primarily very long isoform that can introduce statistical biases in terms of counts which we need to adjust for um and the best way of doing that right now is to take the raw counts to come out of of um the entity workflow this um salmon merge gene council txv and also pass into the differential abundance workflow the transcript length matrix and that allows that modeling to occur um we only added that fairly recently into the only seek workflow so as a second best for older versions of only seek you can just use the the length scaled counts that come out of that workflow um there is now a um a proteomics pathway through the through the uh through the workflow uh built by uh by oscar um and that allows you to take data produced by max font um use the protease package to convert that table into the abundance matrix and normalize that matrix and then pass that matrix into all the downstream um differential analysis and reporting that I've I've just quickly mentioned um before any matrix gets to go further through the workflow we have to do this disvalidation so just to emphasize that a bit more we have to check that the feature um annotations that we provided are compatible with the rows of the matrix that the um sample annotations are compatible with the columns of the matrix and that the um the variables that we've used in defining our contrast are actually present in the sample sheet for example so all that thing all that happens before before anything else goes goes through the workflow um then kind of the first real sort of analysis step is to do some do some filtering um the default is just to remove all rows with all zeros um but there are more more slightly more advanced options available so you can specify a different threshold rather than zero you can specify a number of samples that must pass that threshold or you can specify a proportion of samples that must pass that threshold and that's often quite useful in RNA-seq data where you want to if you have a hundred samples for example and and you want to to say you know that your smallest treatment group size is is is 10 you can say that um you want at least 10 samples to pass the threshold because that would be a quick amount amount to one of your treatment groups passing with passing threshold you could do things along those lines um for differential analysis we have d-seq which is currently used for RNA-seq and other undefined input types and we have lemma which is used for the affimetrix array data and and the proteomics data but both these modules have a consistent interface they accept contrast in the same manner um and they model covariates in the same manner and we might anticipate that as other data modalities get added to the workflow in the future that we had other specialist differential modules um just a note on batch handling um it's bad form that went in the sort of analysis to do an actual batch correction prior to differential analysis instead we model batches of covariate as part of the differential analysis using models that look a bit like this in in dc current or in lemma um but it would be nice to have an actual batch correction in place to to help with exploratory analysis so for example when you're doing pca to remove the effects of that batch in in doing that pc we don't currently have that but we but we do plan on on doing that some point in the future um downstream of the differential stuff we have g-centered enrichment analysis um we've had the g-sea tool in here for a while that's just a wrapper around the brode the brode's g-sea tool um and that's not based on thresholding of of genes it's based on um um sorry that's not based on thresholds it's just based on ranked ranked gene lists um in the data so that that's that's that's uh good way to go um but also uh osca has also incorporated this g profile to method into the pipeline recently it's in an unreleased it's unreleased so it's not in a release version of the workflow right now um but that is based on thresholds so it takes um um gene sets and compares them against the background um one of the reporting outputs of the workflow is is a html report derived from an r markdown file um important that we importantly we also provide the um the r markdown file itself with all the parameters resolved um at the end of the workflow and that allows someone to take that markdown file and the results files that are bundled with it and and customize it to do their own type of analysis or to tweak the plots or someone which can be quite a quite a powerful thing to be able to to do um because we also we can't anticipate all the things that people might want to do in their reporting um another thing option that we have is the shiny ngs package i built that in about 2016 many years ago um and that is a an r package which builds shiny applications based on standard inputs so you give it a set of matrices and some p-value tables and um gene sets and analysis if you have them in the right format um and that will automatically build um an own uh shiny data mining tool um and that allows you to to modify thresholds used to select gene lists and and and produce plots from subsets of the of the genes and and so on which is which is quite nice and the workflow if you if you configure it correctly it will actually push an application directly to to um shinyapps.io which is their kind of platform as a service offering by the RStudio people that will actually so you can have the workflow produce those things automatically the result analysis which is which is really um can be quite nice i should say that i haven't shiny ngs is is an independent product and it's quite powerful i haven't wired every feature of it into the workflow yet so the the things you get from the NFCore differential abundance workflow not all of that connects into shiny ngs yet but but it is um to a large extent there and you can do some quite powerful data mining with it um and you get you get plots like this so this is a volcano plot um uh which if if i was showing you it live uh you'd be able to mouse over these points and find out what genes these are and you can adjust these thresholds to um to move these boundaries around in the plot for example to to show different sets of sets of genes and you can change the color palettes and point sizes and do all that that cool stuff as well and that's powered by plot plotly um and you can export those plots um from this point of interface so to do um we do need to optimize the workflow for larger sample numbers it's currently doesn't the current doesn't perform very well once you get above tens of samples um we need to fix that so that there's adaptive reporting so we don't try and um use some of the plots we have right now for very large sample numbers um we need to cut down the report size somehow um i think probably some of the html can be reduced by rounding some some some of the values i'd like to add some sparse matrix handling uh we have some more gene set methods that people have asked us to add um i want to modify the way that we use gca or at least i have an option to do so so that we use the ranks derived from parametric fold changes from dc and so on um and i want to improve the the shiny ngs in integration of gene sets which is not currently currently wired in um i would like to add the batch correction as i as i mentioned um and uh so the engine of cig is also working on an alternative pathway through this workflow using log ratio analysis um and hopefully that'll be available sometime later this year uh just to give credits i built the initial workflow structure and some of the differential modules osca has since done a lot of work on adding a number of improvements especially in the reporting stages and that proteomics stuff i mentioned as adine was a former colleague of mine at helix and he built the digio query functionality and also it's always necessary to thank the also municipal community members for all their bug fixes pr reviews and and for implementing all the cool standards um i should acknowledge my former employer helix who who funded me during the first phases of development of this workflow um and that was also as an employee he was as he was doing that work uh i'm currently employed by sequera who are a great company with great providing great community support and we should credit osca's employer cubic at the university of tubingen um okay that's me at the end thank you very much thank you so much that's really interesting i love the shiny up really um so uh everyone from the audience is now able to unmute themselves and uh to ask any questions maybe i start with the one that is already in the chat so it is asking uh if sva is used for batch correction for example so that's one i think you could do yes that's um yes so to add that batch correction for use in exploratory analysis yes that could be sva it could be combat any of the sort of commodity batch correction tools yeah thank you um i also have a question so how many ways do these comparisons go is it just like a paired comparison like with two sets or can you have more than two sets you can have any a number of so it's only pair-wise comparisons but you can have as many pair-wise comparisons as you want um you know so as long as you put them in the contrasts file you can you can compare in a number of different ways people have asked for time series analysis that's not currently possible in the workflow um so with the kind of um likelihood ratio tests and so on and we haven't come currently implemented anything like that okay thank you are there any more questions from the audience uh victor is asking uh if he if it doesn't support dose responses could you could you maybe expand a bit on that victor oh well if you have a dose response asked to the school testing for instance so if you have 10 different doses seem like a continuous career effectively exactly yeah yeah no it's just it's just discrete contrast right now yeah thank you um are there any more questions from the audience