 All right, this is started. So last workshop, a CEO from the University of New York, she's a biometric scientist there, and she represents workshop on genomic super-signal. So, CEO, please take it away. Thank you for the introduction. Welcome to my package demo session. I know it's hard to be the last one, but thanks. And so today, I will introduce my new package, genomic super-signature, which is designed for transfer learning and efficient database search. So before I start, I just want to thank Levi and Sean for our awesome mentors and collaborators from the very beginning of this project, and Kuni for a friendly and supportive environment, and our funder. So can you see my cursor moving? Maybe not. Okay. Oh, good. Okay, so I will start. So this is an overview of today's talk. So I will start with the general introduction and go through more detail on what genomic super-signature is and what kinds of analysis you can do and finish with a quick live demo. So we were motivated to make genomic super-signature because too often, new gene expression data is analyzed by itself, even though there are large public database on previous studies. So there have been attempts to use the existing database, but some are limited because they are often hard to use, like require extensive bioinformatics knowledge or requires heavy computing in case where when you need to train your own model or work only in a specific data type, such as like immune cell omni or single cell omni and so on. So we aim to improve these shortcomings of existing methods. So we propose a new method with two main components. So first is pre-trained model named RAP model, which is trained on large heterogeneous public data sets and our package named genomic super-signature for easy application of the model on new data. So our method is robust to batch effects, so applicable across platform and different underlying biology, which I will show in the use cases later. So overall genomic super-signature is designed to interpret gene expression profiles by comparison to published data for the current version of the model data archived in SRA and by connecting them to the relevant literature, match terms, and genesis. The potential application of our method could be like finding similar studies and data sets to your own gene expression data or find pathways associated with your sample or data sets in a form of labeling principal components or comparable analysis across data sets from different platforms, such as expression profiles from microwave and RNA-seq. Or using the continuous scores assigned by our model, we can do digit subtyping at the higher resolution. Through the transfer learning capacity of our method, we expect that we can identify or infer weak or missing signals as well. And the core value of genomic super-signature is data reuse and interoperability, and this diagram is kind of summarizing that concept. And we are focusing on making it easy to analyze new data in the context of the existing database with minimal computational resource and bioinformatics training. So explorer EDA process can be more general bioinformatics, general workflow part of like regular lab, even bad lab work potentially. That's our ultimate goal. So you can check if you want to, like I will try to describe as much as I can within the time limit, but if you want to check more detail on the algorithm of genomic super-signature, you can find it in our recent publication. So let me explain a little more about what genomic super-signature is. So this diagram shows the whole process of a lab model building. So lab represents replicable axis of variation. So the top part is a lab model building process, and the bottom part is its application on new data using genomic super-signature package. So the current version of lab model is trained on about 45,000 RNA-seq data set from RNA-seq samples from 536 studies obtained from SRA. We did a PCA and on the gene expression profiles and clustered the top pieces to build labs. The collection of labs named lab index are further annotated with match terms and gene sets. And the collection of and lab index and annotations and metadata associated with the originating studies all together are provided as a lab model to the user. So this lab model is a lab model is a pre-trained model and once user bring their own data, genomic super-signature assigns quantitative value to the input data and connects it to the existing databases and suggests interpretation shown in this blue box. So it offers associated studies and then quantified phenotype and like match terms and then enriches it sets a link to your own input data. So I briefly mentioned but lab construction process can be summarized in three steps. So first we collect the top pieces, the top principle components of the training dataset and cluster those principal components pieces based on using the hierarchical clustering use experiment correlation. And then average those pieces in the each each cluster and that average loading is named RAP. So here in this example, this is one example of RAP 221 and this and this RAP 221 is built from PC2 of the study ERP 01679N8 PC9 of this study, PC3 of this study. So this is like how RAP is constructed and the variance explained by these pieces are used later to weight the contribution of each study during RAP annotation. So I will explain little more on that next slide and RAP can be compared to principal components of new dataset which require validation process and I will explain that also in a few slides as well. So I mentioned that RAP index is annotated and then for the current model it's annotated with both match terms and genesis. So for match term annotation, first we collect all the match terms assigned to the studies used to build RAP index and each term is reverse weighted by the frequency. That means if term a certain match term is show up more frequently in the training dataset they are penalized. In the that's why the reason we did is because we want to find something specific to the dataset so they and then they are further weighted by the variance explained by PC. I mentioned the previous slide and finally we added additional additional modification available so certain match terms can be filtered out with the customizable drop list. So at the end RAPs are annotated with match terms specific to the given RAP. So for genes set annotation we use the pre-ranked gen list from each RAP so based on their the average loadings and each pathways with the minimum Q value are stored in RAP model and they are ranked by normalized enriched scores. So these two are the main annotation of RAPs and so RAP model is a collection of RAP index and then each annotation so match term and then the genes set and the metadata associated with the training datasets themselves. So RAP model inherits summarized experimental object and the RAP index which is a gen by RAPs matrix is stored in the essay slot and information on the training datasets is stored in the column data and training data slots. So this is how RAP model looks like so for example the current RAP model has like almost 14,000 gen and then 4764 RAPs in it and the metadata that's associated with the RAP model building process so like what version is it how many cluster each cluster size the gen set used to annotate and so on where RAP index is in the essay slot and then see column data has the gen set enriched gen set and then training data information is also stored here in the training datasets slot. So this is RAP and so this diagram is explaining how the validation process works so input from users so here colored in gray is gen by samples matrix so genomic super signature calculate the Pearson correlation between top eight pieces so here pieces of the input data and RAPs and we named this process this correlation coefficient as valid as validation score. The validation score represents the relevance between RAP and new data so all these process colored in green here from the PCA of the input data till the validation process are done by the package so users just need to plug in the their gen by samples matrix and then the package will handle the rest and the package further provide the functionality is to extract information associated with the validated RAP such as literature raw data gen sets and match terms and so on. Okay DSA is here I just summarized the key terms and I explained the RAP RAP index RAP model validation score and the sample score I didn't go through that much but it is actually my matrix multiplication result between the input data and RAP index so that validation is input data as top PC and the RAP and the score sample score is just input data and the RAP and then this I will show how this sample score can be used to use for transfer learning in the benchmark example later so this is obviously I'm in pretty good time so uh so this is a general as a background I think it should be very enough to understand this like next section of like example use cases and a couple of benchmarking example uh the benchmarking example will be on digit sub typing and transfer learning through RAPs so this first example show that is example is applying RAP model on TCGA data sets for database search so Panner uh so first step is to find the RAP that's relevant to your study to your data set so the most straightforward way is using validate function so Panner A and B here are both from valid validation function so this uh and then these are validation results and then displayed in a hit map style which is part of the functionality as well and each column represents each uh different specific RAPs and in Panner A we apply RAP model on five different cancer types and you see that the RAP 221 shows the strong and very specific validation score for breast cancer so I did this validation omni on the breast cancer again which is the Panner B and then Panner B is the validation result of breast cancer omni where the top panel is showing the average silhouette width of the RAP which could be used as a guide to identify the right RAP for your data set but I will not focus that much on today that there are more deeper use usage instruction but so so from validation process we it seems like RAP 221 is the right RAP to analyze breast TCGA breast cancer data set so once you identify the RAP of your interest you can instantly print out associated match terms in the word cloud which is shown here in the Panner C and the relevant studies in Panner D and in which pathways in the Panner E so here we see that match match term match cloud shows the breast as one of the most representative term and then three relevant studies are or on breast cancer and the top in which pathways are or breast cancer associated and these are I will do this in the like I will reproduce this figure in the live demo but and show how quickly we can do this and the next the other use case is interpret PCA results so for this example we I use the immune subset data EMTAP 2452 so this is the immune data set and genomic super signature and extract the enriching sets of the RAP your RAP that shows the highest validation score for the input data as principal component so we currently require at least eight samples for this validation process and then this annotate PC function is identify the RAP that's most highly closely related link to the top eight RAP so PC one is linked to for example here PC one is RAP 23 PC 2 RAP 1552 PC 3 and RAP 1387 and then extract the in which in which pathway link to those RAPs so so and we can extract the image pathway for one PC or we can extract multiple PC image pathway from of the multiple pieces as well and then the I think one of the cool part is you can just you can draw a PCA plot along with the image pathways in the table so this just remember we will I will show how to reproduce this figure in the live demo session as well and so RAPS represents biological features so so we can use genomic super signature beyond the database search is the search so in the next two slides I will show benchmarking example on digits subtyping and transfer learning both are kind of using taking advantage of like RAP representing certain specific biological feature so in this first example here we benchmark the RAP model against the digit specific model and apply the both models on the same test digits dataset which is used the same dataset same digits used for digit specific model so in other words I will make so the digit specific model was trained on the colon cancer dataset and tested on colon cancer cancer datasets but RAP model was trained on the heterogeneous dataset and tested on the colon cancer dataset so the other thing you need to know remember is there the training datasets are quite different so from the training datasets are from the different platforms so digit specific model is using the microarray dataset and majority of test datasets are microarray dataset and genomic super signature is the training dataset it's from RNA-seq and then only one of the one out of ten test dataset was the RNA-seq dataset and then we we compare their performance on separating discrete subtypes labeled here with the different colors and the results you can see even though the training datasets were completely different both platformized and like underlying biology wise our model shows comparable performance to the digit specific model and because RAP model was trained on heterogeneous datasets we expect that it has it can be used for other digits subtyping meaning there is RAP that represents a biological feature to subtyping other digits other digits not just colon cancer so we think uh RAP uh RAP model and genomic super signature can be more versatile and the second example benchmark example is transfer learning so these two scanner plots are from two independent datasets so first using the using the available metadata which is a new trophy account of dataset one we identify the RAP 1551 as the one that's explaining this metadata which is new trophy account feature and we apply this RAP to the dataset two which is completely unrelated uh unrelated to dataset one and then use the apply this RAP on RAP on dataset two to infer the new trophy account and because there is no new trophy metadata from dataset two we use like other tools to estimate the new trophy account of dataset two and then compare and compare that to the RAP assigned score so this here the the score we are using plotted on the x-axis of the x-axis is the sample score I mentioned in the summary terms table so we confirm that RAP 1551 can explain the same phenotype the new trophy account at new trophy account phenotype of dataset two so this means one RAP can explain a specific phenotype across two distinct datasets so this use case suggests that genomic super signature can be used to infer missing data or identify weak and underrepresented biological signals using the existing studies so this is okay I am going very fast so in conclusion so genomic super signature uh demonstrate very efficient and coherent database search and robustness to batch effects and transfer learning capacity and we think the major improvement relative to the existing approaches are first very increased usability through pre-computed model and package and uh versatility by not being limited to any specific biology and applicable across different platforms and the modularity which I hear here I mean like annotation is separated from model building and scalability that's uh I'm trying to uh I mean like fast training procedure those those last two fact two parts are not directly linked to the users but the these two feature together make it very easy to expand the RAP model so it can be it can affect the user at the end by providing more different model collection of models so that's actually kind of like linked to our future direction because one of the main future plan is expanding RAP model to the different training datasets such as single-cell data like different species and microbiome and potentially develop a tool to crosstalk between RAP models so kind of my dream word is you plug in for example mice data set and then can learn from learn relate learn from like relevant like human study and clm study and so on like without much difficulty or learning a lot of like bioinformatics tools or having like huge computer resources in your hand and we also plan to add more annotations like different gene sets and additional metadata of originating studies so that part is related to modularity of the model building process that we can just expand in a more different dimension and so at last more information is available from our paper package and use case site listed here i think i put the all the information and link in the in the bio the at least the web website i think you can access that and then i will do the quick demo so just like for the for input preparation just there are like a few simple requirements so first the gene expression profile from both microwave and RNA can be used as an input so it become clear based on the our use case example earlier and it can be provided as an as an object containing gene by sample matrix so it's including expressions that summarize the experiment or just simple matrix and the this tool performs the best when the input follows a normal distribution so that's like a little bit of preprocessing we are we are asking and the gene should be denoted by gene symbol and i mentioned in the at the beginning but for data set level validation we ask to have have your input at least a samples and this already run the the vignette is here you can check and i will go to the picture oh oh oh i think i did something wrong okay i will share again yeah i'm looking at other screen now so can you see my screen right now okay okay so this vignette is actually i so you go workshop in full there and then when go to my workshop section and the source part is i have the link paper bio conductor package side or the use cases use case size so which this page includes everything or the code and then code to reproduce figures in the paper and oops and workshop vignette that i'm gonna run right now and workshop slide is over over there as well so let's go through this so i have a uh i just plug in new dimensionality reduction function in the so it can now tool can handle the single cell so i will if you want to try single cell i will you i will recommend to use the kit of version of the package but for now see you can please make it full screen and maybe oh oh this is okay sorry about that let's see full screen and then make it oh it's now great thank you so so the right now the there are two there are two lab models available they are stored in stored in google cloud bucket and then i provide the you can download directly using wget but i put the get model function within genomic super signature and then this is the not requestor page so there is no charge on that so you just you can just run this so i just put try to check how long it take it usually take a few seconds each model is a little less than 500 megabyte but you can download very quickly and so your model is there right now rap model and then rap model here is annotated with ginses let's see what ginses so the rap model is annotated with three priors from flyer package so this is mostly about the blood uh blood cell blood related ginses and the rap model c2 is mcdbc2 version 7.1 that's what how what these models are annotated so except on notation the ginset annotation part the other part is identica so i decide to use for the uh so this is decide to use uh b-cell viper some uh so this uh data for the database or gene example so i just will do this sample data and then this is the validation function so validation there is other option you can adjust but it's pretty much like you don't most of case you don't really need to worry about it and so validate and then the so you so validate you put your input data set and rap model and that's take about less than a second and then you do just uh dude we have the hit map table function so it visualize the validation result in the hit map format so that's that's you can how you can extract the uh fine identity this is the process of identifying relative relevant rap and then so this one i just this is the other function that you can collect just the index of the top validated reps and i actually i can just run this but i run this so so at the beginning little bit of you need to put a little bit of effort to find the rap that's representing your study so i did like i how i did it i did uh check the match term word cloud and i think the fifth validation validated index which is uh uh rap 1139 is the one representing i will just do this representing the input data set most best so based on like yeah lymphocytes t cell and then like c28 antigen or these like kind of represent this rap is like the more relevant to the input data set and once you get that once you identify the target rap you can now subset in richard pathways using the subset in richard pathways function and then put rap model and then index you want to check and then find the studies in this cluster so find the studies that's relevant to this rap so find studies in cluster you put rap model and index if you don't put study title don't ask study title by default it just give the study name and then what pc is coming uh coming from that study and then variance explained by that pc but if you put study title it gives actually what study is involved and then you can see here it's pretty much very well uh explain the input data so i think the this is different from just a blasting format or other database search is because it's we are now just how you extract this information is purely just using your data set just expression data itself is direct use so i think this minimized the like nitpick uh the cherry picking or like put your bias in the searching process so i think that way it's beneficiary and then other accessory function like you can just get rap info and then put rap model and then index you want to check and then it got give me give the information on the rap itself or you can do the study info and then put rap model and then put whatever study name study you want to check and then it shows that what is the title of that study and then the number of sample in the study and then the raps that's where those top raps are contributed contributed so it's uh where uh yeah contributing and so i want to uh reproduce the some of the figures in the slide so slide 17 was the tcj data set analysis for quick database search and i will reproduce that here and the one thing i just will note quickly is here i put this file data file in the google cloud bucket again because just like to make this vignette a self-stand but the thing uh the one thing is i use the end build package i think it's martin mentioned briefly and i can't remember who else but this time but the end build package is helping uh end build package has a functionality that you can directly talk to the google cloud bucket so in uh in the r-session without using gsutil in the terminal so i use that to load this data set and then this data set is just built from like tcj data set and then the code to build it is also available in my use case pack uh use case page so i load this data set which is uh list object so so it has like five different tcj data set and then for now i will just do the try to reproduce panel b and then the validate again i did the validate and just hit map and then it gives the figure i showed you earlier so rap221 is the one of the height uh validation score and then draw world cloud rap model to 21 and it gives you the world cloud and then find the studies in cluster rap model someone say something uh it's really brook like cannot understand it was sometimes you can continue can you type and then i can read maybe oh okay do you do i have a model of microwave data itself no i do not have a microwave data yet but that's definitely one thing that can go quickly and then yeah send the message that what kind of model you want and i will i'm i want to make something that can be useful to other people so yeah uh thanks for the question yeah our next one is a microwave rap model so and so yeah so back to the stem part so i can find the studies relevant study i can find uh uh image pathways so this i just uh reproduced my paper's figure too and so the next one is i want to show you is the slide 18 which is the pca plot with the image pathway in it so again i just make this data set available in the google bucket and then i use uh just you just just you till pipe is the ambil function i use that to load that into my session and uh load data and then just just like name just change the color to the make the raw name so so this is like 49 samples and then this data set has like genes raw is gene in the gene symbol and then the color is the samples and then again i will do validation for for uh uh okay i'm not sure what this is but i think it's letting me do proceeds i'll just ignore for now uh so on a tape uh on a tape pc2 and then on a pc123 so that's it and then for the example i had uh i color coded colored the cell type and which is just one of the available metadata for this data set and i decide i will add that i mean you can actually uh have plot without it which will show it's not required input so you will just have this but if you put if you have the uh cell type annotation you can just add that to here and then here pc2 is monocyte and then pc3 seems like some kind of translation ribosome translation and so on so pc2 seems more straightforward but monocyte so that's separating but yeah like this is like our tool is not designed for a hypothesis testing but definitely hypothesis hypothesis building and then eda tool so this can i think this with our tool put you in a better spot to do the for the following following this study and yep i think this is all the example i have for today and hope you enjoy and any question i will take thank you let's see okay question from my audience yeah microwave user data can be used yeah microwave data can be used as input data yeah actually that one of the the colon cancer subtyping that was a microwave data the test data said was microwave data so yes i actually have one so oftentimes when you work with the data from multiple donors the most variation comes from the inter individual variation between your donors and this can dominate the first two principal components in your data while your experiment of factor or your uh yeah factor of interest can be quite suppressed and can your methods deal with this situation and filter out those high variation that is not so interesting for you and just focus on the variation of interest uh so you're let me make sure i understand so you're talking about something like individual differences like wouldn't that noise will interfere or not like can i remove that kind of noise from our analysis but yeah i wouldn't call it noise but yeah yeah yes i think so i think i for that kind of case i will state as something like so so your sample is like a different patient like the same disease patient with same disease but so i like okay let's say you have a data set with 10 donors five diseases and five healthy but your main variation main source of variation is the inter individual differences and not the disease so let's say the disease variation is like dc5 or dc9 in your i have that i should test i haven't tested that kind of samples but my expectation is i think once you do that you will have like raps that's like rank top ranked but it's kind of scored differently so like something that's like more common and then like common like i will say that's something like you will have a signature that's for like even though it's weaker something like disease per individual versus healthy individual healthy individual the signature for those two different groups and then potentially like individual specific raps will show up for the individual data sets i think both will appear but how they will rank in your result can vary depending on how strong the individual differences versus the after a common disease related features so on so i think both will i think in some sense i think those signals will be dissected from separated using our method by pulling by validating pulling different raps so that's what i expect but yeah i haven't okay thank you more online questions there was one just wondering i'm just coming up to this you can hear me i'm just wondering quick question on the annotation of the raps so you've done the pca's on all the different data sets and then you have to annotate the raps and you're doing it by pre-ranked gsa or by mesh terms um was there much value from the mesh terms or was it primarily from the gsa like when you were actually looking at those terms where did the what were the terms that were most use most useful so you are asking what will when will i use mesh terms over no no no so so each principal component you're clustering the principal components based on person correlation from what i understand but let me know if i'm gotten this wrong and then you're annotating those components or the centroid of those components based on gene set enrichment analysis on a mesh terms no no mesh term annotation and gene set of enrichment is completely separate module of annotation oh so you either annotate with one or the other but not both is it no they are annotated given rap is annotated by both mesh term and then the uh gene set enrichment analysis yeah one single rap is annotated by both of them two okay following question yes so so i some from the stuff we've we've done in the past i found that the mesh terms were very crude and often didn't add actually any additional they added very little value so i'm just wondering whether because your approach was quite different it wasn't what we've done i was just wondering whether you were able to get value from the mesh terms or whether they were so they did whether they added value or not that's literally what i'm just wondering is whether the the mesh terms you spoke yeah so for me uh because i did a bunch of like waiting process so in some sense it's like so it's mesh this my mesh term collection is not just like a single study but like it's a collection of studies and then like they are already already also weighted by like different studies contribution to the given rap so it gave me better quicker better a little crude off than gene set but quicker and then better glimpse of the rap what those specific rap is representing so in that sense i will use that uh uh mesh term more for like at the beginning like search for the uh your rap of interest so it's that's what i will say but yeah i mean oh no okay okay glad you got more you've searched for them we did by the sun's word yeah i just think yeah all right we are out of time thank you see you in again