 Hello, can anybody online hear me sound check? Okay, great All right, remember to everyone online put any questions you have in the chat We will get to them at the end with that. I will turn it over Hello, okay Good afternoon everyone. Thank you for joining us today for our package demo workshop So I'm Xing Ge Wang from Professor Riemann's lab at the University of Illinois at Chicago Today I'm going to co-host these workshop together with Shang Gao who is standing beside me wearing the same t-shirt And the title for today's workshop will be analyzing cellular heterogeneity across time and across biomedical conditions or interventions So here's our line of today's workshop. So we're going to introduce two our packages So the first part Shawn is going to introduce the Bayfem So Bayfem is a Bayesian inference model that integrates the chip-seq transcription factory version into the single cell RSC data And since Bayfem gonna take some some like a long time to run So Shawn is going to introduce the major functions and workflows of Bayfem using the slides and the second part I'm going to demonstrate the TranCatcher R package which was designed to analyze the time course RSC in data so included both the bulk RSC and the single cell RSC data and We hopefully together with these two kids It's gonna help you to demonstrate some fundamental biological questions regarding the cell heterogeneity across medical conditions and time Okay, so if You are using an orchestra to run the live coding for TranCatcher. You can start to launch right now And then we also provide two other optional to long to download the TranCatcher One is Docker image the other one you can just install from our GitHub and for today's document Documentary like you can see we also put it on our GitHub and the for bioconductor workshop So If you go right here You're gonna see the details and all the vagueness right here Okay. Okay. I'm gonna hand these to Shawn to introduce Bayfem so Yeah, so before BFM we don't have live coding now So we can just go to the GitHub to download the the package and look at the food material But I will major explain the major functions of BFM and yeah, you can have a sense of how to use it Okay, so good afternoon everyone Thank you for attending this workshop and it's my pleasure to share our package with everyone here. So Today I would like to introduce our package BFM So I think everyone has already know the singles RNA sequencing Technique and the motivation of doing singles RNA sequencing is trying to identify the cell heterogeneity so the next question we are asking is What drive the cellular heterogeneity at the transcriptome level? So as we all know that the transcription factors are the key regulator gene expression We can infer transcription activity in single cells We can kind of answer the first question We can also use the infer transcription activity to identify the cell heterogeneity such as clustering cells or send the Sudo time to each individual cell. So here we develop a bathing Hierarchical model BFM to infer transcription activities by integrating the gene expressions with transcription factor chipset beta and BFM can be used to infer heterogeneity of transfer factors between the cell types and Or between the different conditions I'll give a brief introduction about the BFM framework. So first of all, we have a normalized the singles RNA sequencing gene expression the constable and the BFM is a factor analysis model so we will factorize this Y matrix into two matrix W and D and During this inference, we will also integrate with another Chipsic prayer knowledge. It's in this binary matrix We will have the relationship between the target genes and the transcription factors so in this matrix if the If the gene is a potential target of the transcription factor But from the Chipsic data then this entry will be one otherwise it will be zero and from this binary matrix we will assign different prayer distribution on this W matrix. So then this Sorry, these two matrix will have biological meaning The double matrix will be the regulatory strength between transcription factor and target genes in this specific single cell dataset and The Z matrix will be the infer transcription factor activities And we can use the W and D matrix to do further downstream analysis to identify the cell hydrogenated For example doing clustering and the trajectory building So then next I will introduce the major functions in BFM So there are four major functions in the package. So the first one will be BFM prepossessed So in this function, we will use sera to normalize the single cell RNA sequencing data and filter the genes by the most variable expression and also to identify the The most variable express the transcription factors. So that is the Default setting for select the transcription factors. I will show you you can definitely add your the transcription factor You are interested in so I will show you how to do that later and the next function is BFM So this one is a main function of the package So this one will infer the transcription factor activities from the single cell RNA sick data based on the chip sick chip sick prayer prayer knowledge and this in this function we are using VB function in our stand and This function are using automatic differentiation variational inference method to to learn the variational posterior posterior distribution parameters and The third function in BFM is BFM activities and this function is used to extract the infer transcription factor activities from the outputs of the main function and the last one will be BFM weights and This function will extract the transcription factor and the gene weights from the output of this main function So now let's look at Some the the the structure of how to use BFM with the data set which is adult liver single cell RNA singling data To show how we can use this first of all we'll use a BFM preprocess to preprocess the data so you can use so this you are all data is a Count all counts matrix You can just a part feed this to the BFM preprocess and then it will Generate the normalized the matrix for you normalize the count matrix for you and Then you can pass this normalize the matrix into the BFM main functions And here are multiple augments in this main function The first one will be your normalized the matrix the second one will be the species you are your single Sonaristic data is so we only support human and the mouse in in this version and also here is a way how you add it Your interested transfer factors, so it will be a vector and you can you pass a vector into this augment and each of the element in this all vector will be your Transfer factor that you are interested in and make sure that it's a it's a gene name It's a gene symbol of the transfer factors And also here is how many cores you want to use and this one either will be Your the literate number of iterations you want to learn in this Inference and the default setting is a southern because we are using the automatic differentiation variational inference, so the key idea is they are using the Stochastic a gradient descent to find the most optimized variational posterior Parameters distribution parameters, so this one just a default setting You can first to try this default setting if you find after a southern Literations it does does not converge you can always like increase this one But you already based on our experience experience this one is good enough So the last two augment is the torrent relevant norm of the objective so this one is a default setting is point zero five double double zero five and We we think this before one is the target of the object you are learning so this one is good enough You can always play with it Okay, so after you get the result from the BFM main matrix So you can extract the infer transfer factor activities So we just use the call this function BFM activities, and then we'll give you the infer transfer factor activity so Then we can look at the hydrogen 80 of the infer transfer factor so I did a t-sne on this with this infer transfer factor activity matrix and here is t-sne plot with adult liver single-terrain stick data and different color is different cell types in a lever and we could see that the the cells come from the same cell types are grouped together and And also the next question will be for every single cell types. What are the highly activated? Transfer factors so here. I'm using a random forest model to get that information So the key idea is you labeled yourself With the different transfer different cell types you can this is a binary label So for one particular cell types you just a label one another cell types You buy zero and then you use the feature importance in this random forest to identify Which transfer factor is the most important transfer factor to? to predict or to Classify the the cell types so but also you can use you can try different type of method to do this for example You do AOC based the one or you can do static testing to find the most highly Activated a transfer factor in your dataset, but the final output will be something like this So it will show you like in different cell types So what are the marker transfer factor highly activated transfer factor infer transfer factors from your dataset? So okay, here's a brief summary about BFAM So first one infer transfer factor activity is based on single cell RNA-seq data and the chip-seq data Then we we learned the transfer factor activity profiles accurately identified the cell heterogeneity So the last function I didn't show here but you can also always look at the with the BFAM with matrix to extract the Regulatory strength infer regulatory strengths between transfer factor and its target genes. So yeah We our paper is published last year in genome research and our the code is public available in our G-hub and there I have another comment here as we are BFAM is under under improvement So we have three different direction. The one is like besides the chip-seq data What else can we integrate into the single cell RNA-seq data? So for example, we can have a single cell ATAC-seq data or different type of All mixed data. So I think how to integrate multi all mixed data into single cell RNA-seq analysis That will be one direction. The second one is we are using In this version, we are just to use assigned the W matrix with different with a With a different player knowledge. So that's the way how we integrate chip-seq knowledge into the analysis But is there any other elegant way to do that? I think that that is one direction We are interested in and the third one is The the key the method to do the inference is a automatic differentiation variational inference and we are using our stand to do that, but currently there are multiple other libraries like TensorFlow probability and the Pytorch. I think we are really interested in to Connect the Bayesian hierarchical model with the deep learning package. So that will be a third direction. We are working on Yeah, so feel free to raise a issue if you have any question about BFAM Or you have any other things want to discuss you can this is my Personal email. Yeah, thank you. All right. I know there was at least one online question For trying now So for Ryan Thompson wanted to know does BFAM assume that the chip-seq data is static? If you had chip-seq data for each sample or each experimental condition, could it use that? Um So currently we do we do not support that so the we are using the GTR a database is which is a building in the package But I think you can definitely change your Chipsic target your prior information into these BFAM framework. So you just to change your binary matrix How to Yeah, like here So if you for example, if you have a I don't think you have all the transfer factor chip-seq data, right? So yeah, you only have maybe in your lab only have one or two chip-seq In particular transfer factors. So you can definitely change this binary matrix with your experiment data But I'm glad to help you to do that modification, but currently we don't support it Automatically. Yeah. Yeah, I was a similar question. I guess I'm wondering Does the model only it's only able to infer activity for those Transcription factors that you have chip-seq from GTRD, right? If the chip-seq data if you the chip-seq that is not available We cannot do any inference on this Okay. Yeah. Yes Okay, thank you Sean. Um, yeah, let's move on to the second part. So for the young catcher on the Potential biological application is pretty straightforward So if you have a longitudinal experimental design and you collect your time course RAC data both both like both or are single cell RSC data You want to know which are the most dynamic genes through the time course? and then another question is After you have these gene-wise Dynamic information There are every a very like unbiased way to visualize or extract the dynamic pathways biological pathways through the time course for example biological processes like disease progression or embryonic development Okay, so here I put a figure from our manuscript in JCI inside and basically give me a Overview of the framework or trend catcher. So for the user part You just need to provide it the Count matrix. So for example, if you are using a bulk RSC You just provide us a count matrix each row is your gene and each column is a sample you collected at certain time point and we encourage you to use equal or more than three replicas for each time point If you are doing single cell RSC So we encourage you to generate some pursuit of bulk are a library So for example, you can do the integration and the cell type annotation beforehand and then for each cell type you identified you isolate these cell types and you can either use average expression or gene summation like for each single individual gene and then construct this pursuit of bulk Count table and you can feed these to the trend catcher algorithm. So what you can get is Of course is a master table is gonna tell you Which one which G is genes are actually Significant and dynamic through the time course and a set of visualization functions. So when unique thing is very useful for the biologist So instead of looking at the individual genes, you actually we generate this time he mapped can tell you Which biological pathways how the biological pathway enrichment change or activity or deactivated through the time course? And I would like to spend like one or two minutes to go through the major steps for the algorithm because Later on I want to show Like where like the information you can use from the result So first step is we call it a baseline fluctuation confidence interval construction Basically, you need to tell trend catcher. What is our baseline time and we're gonna fit the gene expression information into a medical binomial distribution. So and then we can construct the Estimated mean gene expression dispersion from for the baseline and for the non baseline time point data We are applying a smooth ANOVA curve fitting orism and then so instead of looking at the replicas now you have a fitted account for each time point for each single individual genes and Using all these estimated gene expression through the time course You can compare for a non baseline gene expression data Compared to the baseline. How far it is actually go away from the baseline expression and in this way we can Do how about this testing to assign a p-value for each time point and then user-fisher P-value combination method we are able to assign a dynamic p-value for each single gene and We also use we call it a breakpoint screening strategy to assign the gene trajectories So for each gene we give a trajectory types type So later on we use this information to construct the visualization functions so this is the overview for trend catcher and So if you look at the gay hot page for today's workshop, there are four articles right there I think it's like Was designed to purpose to answer certain questions, but it will be a lot to present for the time concern So we're gonna run through some most important functions to show you What is input what is output and how do you interpret this output? Okay? Let me go to the orchestral Okay Okay, so after you launch the orchestral You can click the Venice folder and you're gonna see like the our markdown files for four different articles will be there So the first one is the most important one Let me see if it can make the farm size bigger Okay, it's visible for the eye for the audience, okay Okay, let's library the package, okay, I can just run the button here So to Demonstrate how it works. We provided some demo data So if you look at the package folder, you're gonna find it in the this folder the ext data folder So for the demo data It is actually a real biological experimental data It was collected from the brain endothelial cells from the mouse After the LPS the injury so we measured at six different time points like the baseline six hours 24 hours and so on You can load it Just provide where is the file is and then So yeah here so you can see it is a Matrix and the each row is a gene So here is a gene sample ID and each column So for the input data, we have some Requirements for the column data. So the name need to be composed by three parts. So one is the project name You can just put any characters here and the second Is your time point? So it's like number converting to the same unit and The last one is your replicate ID and they must start with the REP and just to put number between it Okay, so the major function and the most important function from truncature is this one. So run truncature All you need to provide is your count table, which we encourage to use a CSV file And then you can tell Truncature, what is your baseline time? So here is zero and your time unit This is not for analysis, but for the further visualization the figure title and so on and we also do gene low-count gene filtering and You can also run it on using multiple cores and The last parameter here is the dynamic p-value threshold. So we strongly suggest you just leave it as 0.05 So these function So I run it on my workstation and use 14 cores. So normally this Just finish like in within two minutes. So it's pretty fast But if you are using like single core or like very few cores, it could take 10 to 15 minutes to finish if you are like around 20,000 genes So so for time limits, I just want to demonstrate what is the output? So the output will we call it a master list and this is the essential to do the further analysis and visualization I also so here will not run this call, but you can run it after the workshop But the output of the mask list I put it in one of the demo data You can just run these code to load it and then I want to show What is the like the the data structure like output? So if you look at the master list right here names mask list You're gonna see they this is like a list contains many elements. So the first four or five Elements are actually the basic information the user provided of course include your countable what I want to Demonstrate is the fitted count data and the master table Okay, so We can just go head just run head Master list and then you extract these fit fitted count it right here Okay So here you can see this is the after the smooth or no occur fitting. So this is After run tranquill this for each gene You're gonna have a fitted count from your replicates and this takes the the the time sequential information inside to occur fitting and the mule and DSP dispersion. This is the baseline Mean expression and dispersion as the meaning by trunk catcher and t dot p dot the value. So this is the The hypothesis testing to for each time point to see how far is actually The expression changed compared to the baseline and of course for each single gene we assign a Combined p-value and then if you have like tens of thousand genes, you do you need to perform the p-value or the adjustment so the last column the p-value dot a DJ is your Statistical testing like fight the the p-value you're gonna use to decide if the gene is Significantly differentially expressed through the time course and the the last element from this list Is master table? So there are many columns, but I want to show you The trajectory pattern we assigned for each gene. So basically it's just a string Tell you for each gene Correspond to each row and for example first gene The gene expressions start to increase until 48 hours and start start to going down until the end of the experiment So this is the most important Analysis for the trunk catcher and after you grab this master list from the output You can just do a lot of a further analysis and visualizations using this master list If you go to the second article the gene trajectory analysis, so one thing is if you are providing the the gene expression metrics using ensemble ID it is impossible to to like to know which gene it is so one thing is we encourage you to add a symbol gene column to your master table Are they still sorry so well that waiting so we provide some helper functions So if you are providing assemble ID, you can just to run this set of a code. It will just to automatically Convert the gene ensemble into Your gene symbol so basically this trunk of a cold what it does is at the end as you can see add a two new columns to your Master cable so now if you plot figures, you know like which gene is actually Corresponding to So and then if you look at the manual trunk after you're going to see a lot of functions for visualization And they all start with a draw so it's draw something so first function is Draw gene trajectories so as you just to put your master list in there and then you put the gene Ensemble names you want you want to plot So the output is you can see your fitted count data for example The black dots are actually your observed data like from many replicates and the red dots is your The fitted the fitted count data from train catcher and then the gray horizontal line actually the baseline fluctuation interval I mentioned before and you can also grab the The p-value for each dynamic p-value for each gene at the trajectory patterns where it belongs and another important question when people Assign your trajectories to gene They want to know The composition like which for example like as most of the genes actually follow a monotonic trend or is follow like By phase trend and they want to know the timing of the activation or deactivation so the function to run it is Draw trajectory cluster grade you just put your master list of there and then Oh well Yeah, it's working. Thank you Okay, so Yeah, one question is the the clustering or the grouping of the gene trajectories So for example, LPS injury will be a very like like injury to your body, so we're expecting some hyper-inflammation genes which will activity in the early time So as you can see like forty four hundred twenty one genes are actually following this by phase trend and then they Tend to activity in the very early time of six hours forty hours twenty four hours So this is one of the visualization function There are a set of visualizations that you can play with later on after this workshop So I will just go to the next time he map, which is a unique feature from train catcher So the to generate time he map right here, so you Just need to Call these draw time he map go function. Okay. Thank you We also provide so belong except the goal enrichment analysis So we also integrate the enrich our so you can do either way So here I give example you can just support your master list there and you tell train catcher Which you're you're running like using human or the mouse and then the goal in a set of other parameters to run goal enrichment So I suggest to only plot like top 10 Go functions right there So it's gonna grab the top 10 up regularly and down regularly genes through the for for the time he map So these function is gonna if you run this chunk of function is going to take several minutes to run But I to show what it's all put I Put the time he the map a list object also in the our package demo data This is a pretty large object Okay But as you can see it contains three elements. So first of course is your figure You can just print it out and You're gonna see so this is unbiased way to actually tell you which biology pathways actually dynamic through the time course and As you see like there are a lot of go terms actually redundant For example, the extra extra seller extra seller structure matrix organization So there are three terms actually that's thing. So to do that We provide another function You can just a select part of that. So just call these let go And It will just to remove you can just a manually remove the redundant term and just the plot this is for your publication So the last function and not the last function for this article It will be very useful for the biologists So we know about pathways and they want to know like which genes are actually They should pay attention to So you can call this draw go heat map function and you tell them which is the time window you're interested in and It will show the Kind of a heat map to show the gene expression Level heat map so this gonna take a few seconds to run we're still going okay So so this for zero to six hours These set of genes coming from your Selected interest the pathways and then there's a corresponding logful change And then you can pick so you can narrow it down to a set of genes like you can use for the further analysis And the last part of the the last article is So for example, you have more than one group and you are doing Unituno experiment setting for two different biological conditions. So here we use the example from our paper so For example, you have a severe COVID-19 patients group and Moderate COVID-19 group and then you have you so you run truncator twice And you have to master list and you also run time he map you have to time he map object And you want to and then you realize one of the biological functions was uniquely showing in One of the analysis and you want to compare Is actually a significant difference between the two temporal Groups and of course if you are using single cell for each cell type You're running truncator and you can compare two different cell types So here we so the way to do it We provide a permutation test Together with the visualization just the call curve compare permutation put your tool master list there and then for example the neutral field activation we see in the severe group, but we wonder is actually it's gonna separate the biological groups and this is a super fast to run and after that what you can get is Lloyd's curve fitting for the two different groups for the neutral field activation genes and The gray area indicates after the permutation test These two groups are started to separate significantly around like the second week after the symptom on site So this is the last article Um So I think if you go through the four big news articles, you can do a pretty thorough analysis for your time course They are a stick data and for the further documentation and like details you can go to our github page And if you are trying using truncator, please cite our article was published in February DCI inside Yeah, and the end I oh sorry at the end I like to Okay, to thank the members of our lab, especially our professor for Julius Rima and then Shangal right here and our collaborators And thank you very much. I like to take questions any questions to the audience. Thanks for the talk and especially I have a question about Catcher a trend catcher Do you compare the two? groups, yeah, if you're in versus mother and then how did you compare I mean you are kind of Conducting a state retesting. So did you conduct a testing at each time point separately? So so for time course you have like for example for the same gene And you have expression trajectory from two different groups, right? Yeah, and the permutation test is to shuffle the label of the gene expression Like and then you do the curve fitting and then what we do is calculate the area difference So in this way, if you run 1000 times, you're gonna have like kind of an empirical distribution of The shuffling data and you can test to the hypothesis testing. So the testing is going to be for each gene one by one Yeah, that's a very good question. Yeah, so so here If you oh It's not here So we do the log curve fitting for example a neutral field activation We find 117 genes and then we do a like a curve fitting for each group using the neutral field genes Activation genes and then you compare those two curves So not like 117 gene trajectory curves. You're doing that But I guess there are more like Like ways to do it, but so far this is the package support Thank you. And the other question is you showed us trajectory clusters So the bunch of genes are falling into this kind of the patterns. Yeah, yeah How did you group them? Oh that that's we didn't apply any machine learning So we assigned the gene trajectory pattern from after we run the trunk catcher, right? And you have just a descriptive term for that gene No Because because trunk catcher algorithm already considered a temporal thing And then yeah, I mean you can just use a clustering for a time for there are several methods But the smooth or Noah actually takes the text secretion information inside So I think the estimation is enough and then we just use the break point to assign the Interjectories and then just group them Yeah Thank you for the talk. I have a question for Okay, yeah, did you try to use this for a taxic data? I assume you could treat it the peaks Yes, right as genes. This is a very good question. So that's actually we would like to extend the package to like a time course a taxic data and we're also Collaborating with some other life to help to generate data and analysis. So yeah, I'm also looking forward to Extend the package for that Yeah, follow up on this question So I assume you could also run both a taxic and only seek together in your classroom, right? And you can see another classroom, but then the time series then you can see which gene which peaks they go together Okay, okay Yeah, you can run separately but the further Interpretation if you want to find like which happens fast like at first could cause a little later event that will be Yes, so the quality we need to think more carefully about that but on yeah, if For a taxic data and RZ data, I think if we extend of our package later You could run both and then the further Interpretation we still need to think carefully about that Okay, thank you happen first before your gene expression, right? So Is there any way to infer the causal relationship between the openness of the chromatin and the expression of the But the genes on a time course manner. I think that will be a very interesting question Right. So I think the question is whether this delays because of association or is there's a causal relationship between that that thing Um Yeah, I think that's a very interesting question. Okay, there's a couple of online questions from West Wilson How does the smoothing spline a nova model of trendcatcher compared to the other popular bio C time course packages like EBC HMM they use hidden markup model or Masec Pro that's a NB plus PR or spline TC spline regression Yeah, so We tried several curve fitting methods out there Like I don't think it's from bioconductor, but some general curve fitting methods So the because for example, it's really depends on the biological question For example, we are more curious about the injury which actually gonna cause the sharp increase or decrease from your gene at a very short time. So So far the smooth on over which is kind of a pretty good trade-off between the smoothness and the the fitness so we realize that for a sharp especially for like a Local change like large local change happening a very short time is actually pretty good performance compared to the other general curve fitting methods Yeah Next question is does trendcatcher require that all subjects to have the same set of common time points Can it handle cases where subject a has samples on day one and three but subject B has samples on day two and four Can you repeat the question does require all subjects to have the same time points or all treatments to have the same time points? Yes, I think so far yes because The requirement for the the input count table is really need to be the same point for the multiple replicates biological replicates Okay, and then also West followed up on that for Ryan's question said that was gonna be his next question That other spline approaches I have tended to need that so I was curious. That's why I used ebc HMM over spline TC in the past Yeah, we for the simulated data set we actually compare with the DC to DC to spline curve fitting We perform slightly better than them but the advantage of using trunk after is for longer time course series when the trajectory pattern was more complicated for example more tomorrow. We actually performs better but for like like short time or limited like small time Number of time points the performance is very similar actually Okay We're out of time. Let's thank you and John again for nice