 actually they have many biological functions. So basically they are biologically meaningful features. And so in the consensus cluster we need to choose a clustering method. So I think k-means is the most used method. So in Kola we suggest or recommend to use s-k-means as spherical k-means which is a variant of the standard k-means clustering. But it use a cosine similarity instead of the Euclidian similarity. So it's like project all the data points to sphere or hypersphere and just to apply this k-means like method but on the surface of this hypersphere to look for groups. So it's kind of like a correlation based k-means clustering. And so when we have a list of individual clustering then we can have the probability of two samples in the same subgroup and the result is represented as a so-called Collins matrix. And this matrix is a symmetric matrix and the rows are columns as samples in the or the columns in the original matrix and the values are between 0 and 1. The value is the probability of the sample i and sample g in the same subgroup. So with this consensus matrix we can actually we can evaluate the stability of the classification. So in this the left plot shows a unstable classification and the right one shows a stable classification. So we can see on the right the color either blue or white. So white means the two samples and they are never in the same subgroup and the color is blue means the samples are always in the subgroup. So the red one is a stable clustering and the left one is unstable clustering. And we all normally we do not know how many subgroups there are we will try a list of different numbers of groups. So we can then we need some method to decide which k or k is the number of subgroups which k is better. And there is one method called the pac the pac stands for the proportion of the ambiguous clustering. So again the left hit map is unstable classification and the right one that is stable classification. So if we make a cdf of the consensus values in the two matrix so the red line corresponds to the to the left one and the blue line corresponds to the the right one. So for example for the red one the the right one which is stable classification can see because the consensus value is either very close to zero or very close to one. So this cdf line is very flat in the middle but as a comparison the the left one which is the the red line in in the most right plot actually you can see and this it has a this step pattern. So the the pac measures and is is the distance the the distance on the y direction that so so we just pick one value which is close to zero for example 0.1 and another value for example 0.9 and the pac is the difference of these two points. And you can see when the classification is stable the value the pac value is very small and when the classification is not stable the pac value is very large. So we use one minus pac to measure the stability of the class of the consensus classification. So when one minus pac is larger than 0.9 we say the classification is stable. And the COLA also provides many visualizations for interpreting the consensus clustering results. So the first one is the consensus hit map. The second one is so-called the membership hit map which every row is an individual individual clustering and we can see can see in this data set in this example the individual clustering are very consistent consistent and this plot shows the different metrics to decide which k is the best. And the the plot the bottom left is the dimension reduction plot like umap or tisny or pca and the bottom middle is the signature the hit map for signatures signature genes for example. And the COLA on the features of COLA package it allows to integrate and self-defined methods. So if you remember in this slide like in every step you can choose a different method for example for features selection for clustering petitioning so the COLA allows you to use the the predefined methods that is already integrated in the COLA package you can also use self-defined methods. And the COLA were run multiple methods simultaneously and it provides nice relation to help you to select the the best combination of methods. So for example in this left plot there are four features selection methods six clustering methods and this is a consensus hit map and we can very easily see this one this one and this one this show very stable clustering. And for a specific method there are also plots to help you to to choose to decide which k is the best for example here maybe you can see k equals two or k equals three are the best or are the most stable classification but when k increases to four five or six the classification becomes unstable so it's very straightforward to to see which result which result is the best. And we also did some benchmark we applied COLA to several hundreds of public dataset and we find we found the ATC with SK means normally generates more subgroups and also stable subgroups. So in this plot you can see this the or more on the top means the classification classifications are more stable you can see basically they are all ATC or SK means integrated method. And according to our experience the method if you use ATC or SK means it can generate new subgroups because this is a correlation based method and I think a lot of other methods they are you they use this variance based method or use Euclidean distance based method. So according to our experience the ATC and ATC means can generate new subgroups which is different from for example current subgroups or subtypes if they do kinds of studies and we also find the newest new subgroups funded by ATC or SK means are more biologically meaningful. For example we see from the survival data they show a more significant difference on the survival. So COLA has been published in last year so if you are interested you can check the paper. And there are some two major limits of the standard consensus clustering method. The first is it cannot separate major subgroups and secondary subgroups. So the major subgroups as those subgroups show big difference and the secondary subgroups they show a smaller difference. So if a data contains both like big subgroups and small subgroups and normally the small subgroups or secondary subgroups cannot be separated. And the second limitation is when there are a huge number of subgroups in the data normally the standard consensus clustering method can only identify to a small number of cases. So it cannot identify all the subgroups it will merge some subgroups. So I will demonstrate these two limitations by an example. So first is demonstrate the difficulties to distinguish to simultaneously distinguish major and secondary subgroups. So this is based on randomly generated data. So I generated four subgroups so the red and blue they are major subgroups and this green people are the secondary subgroups we can see the difference are very small. And then the PCA shows these four groups can be very well separated but the green one people can not be separated in the first principle component but can be separated if they take the first two principle components. And if we apply the standard consensus clustering actually the best K the best number of subgroups is three. We can also confirm this by this membership a hit map. So when K is three the cluster is perfect you can see it almost agrees for all these individual clusters clustering but when we set K I mean when we set to four groups the classification becomes unstable. So only in 10 percent of all these individual clusterings these two subgroups can be separated but in the remaining 90 percent of the clustering they cannot be separated which results it will not choose four groups as the result but it will only choose three groups as the final result. And the small K problem so this is again the HSM single cell dataset. So the A is the top figure 2.1 thousand chains that we will apply the consensus clustering on. So this hit map shows very clearly there are many subgroups in this dataset but if we apply the standard consensus clustering you can see actually it prefers either to pick K equals two or if we allow some robustness maybe we can choose K equals two six as the best number of the subgroups but I mean obviously there are more than six subgroups maybe there are 10 or 11 or 12 so can if we check the membership hit map and the consensus hit map we can see when K is two or three or four the classification are quite stable but when it increases to five six seven eight this the classification becomes unstable it's because when there are more subgroups the probability of two samples and being in different subgroups will increase which will decrease the stability of the classification. So we in the COLA we propose a second method called hierarchical consensus clustering so this is a very simple method the main idea is to apply the consensus clustering in a hierarchical way so like if we have a cohort so we first apply the the consensus clustering but only with a small number of subgroups let's say three subgroups and we find the three subgroups stable and then for each subset of the samples we recursively apply the consensus clustering to every subset of samples and this process is applied hierarchically and the process stops and tears some criteria are met so then we will have a hierarchy of the the subgroups so again we apply this a hierarchical consensus partitioning HCP on this HSMM dataset we can now we can detect a lot of subgroups 12 subgroups and they are basically they are well separated in this tisny plot and you can see also the the figures see the the the signature genes are the signature genes are very let me the patterns are very clear and the figure D shows these genes are biologically meaningful and the figure E shows each HCP classification is stable so I just would just repeatedly run HCP for 20 times and this is basically it shows the classification is almost agree very well in these 20 20 rounds we also apply HCP on the p m b c dataset the single cell and the dataset and we compare the HCP classification to the sorat classification and classification and the standard cola so the cp course wants you consensus partition the standard consensus partition classification so basically you can see the HCP can find more subgroups compared to the standard cp and there are there are some agreement to the sorat classification as there also some disagreement to the sorat classification so basically the red and the red cluster in sorat and it was and and and it was and separated into two smaller groups you can see here this is the cola HCP classification and this is the sorat classification and actually we find that actually these two subgroups are quite similar in the gene compression in these two groups are very similar and if we choose a loose cut off for HCP actually they can also be separated into two groups but also the genes in these two subgroups are very similar and the figure E is the the heat map for the signature genes in this classification and the figure E shows the stability of the HCP classification on p m b c dataset basically very stable and then we also applied HCP on this central nervous system tumor dataset so this is a ambitious analysis this dataset contains more or more three thousand samples and fourteen tumor types and in total 91 subtypes and we want to see if we apply HCP on this dataset what's how how um how where the HCP classification sub classification agree to this 91 subtype classification and we can see basically the we cannot say they are perfectly agreed but you can see they agreed very well by figure b and figure d and and there are some comments on the HCP method so basically if a data is very hydrogenary heterogeneous it will generate if you apply HCP it will generate the many many more subgroups um so this is example from a glioblastoma dataset very heterogeneous and if we apply HCP um on this on this dataset you will always see there are and there are many many subgroups and some some subgroups are very small only contain six or seven samples I mean if we look at the signature heat map here I mean indeed this classification this classification makes sense but I mean with too many subgroups maybe it's it is meaningful only for this dataset but it it it loses generality on other dataset focused on the same biological object so this HCP is also published and in this in also published so if we are interested you can also check this paper so last the last one is um we have some optimization from large dataset so basically consensus clustering because it repeated around clustering many many times and it's a time consuming process so for huge for huge dataset we implement a so-called thumb thumb sampling consensus clustering the idea is very simple so basically randomly sample a subset of samples to perform consensus consensus clustering and then we look for the signatures and then we predict the class labels for the unselect uh samples okay now I will show you some uh code very quickly and so use colise very simple uh basically you just run and these these several lines you just use run or consensus partition method and you just provide your metrics and it will perform the consensus partitioning method with and I think by default it will try four different feature selection method and it's five uh five clustering method and when you when you finish the analysis you just run this caller report on this object and it will generate a very detailed report for the whole analysis so basically I mean these code are runnable you basically so this block basically and like a processing of the matrix and for the hierarchical consensus partitioning is similar so this block of code just to generate a matrix is m and this time we use hierarchical partitioning function and we just provide the metrics and we just run this function and we use caller report function to generate the the report and I will show you this report this is example for the standard consensus clustering so it will generate a very detailed report with many many closed stuff and if I just quickly go through this report so this if you if you just type this object it will tell you which functions you can use for this object and you can also see the the distribution of the data this is like a qt and on some tables like which which method so here we tried four uh feature selection method and six a cluster method so combination 24 method for the clustering and this shows the the best k for each method and these different statistics and which are stable classifications some plot consensus consensus heat map for different k membership for different k there are many many plots and this is the overview of the clustering from different methods for different k and many plots and then we can also go to the result for individual method let's check just pick this one 87 sk means so this is the the plot for different k like integrated plot and there so this plot shows which k is the best help studio decide and this is the the class labels classifications here and this plot consensus heat map membership signature genes and anyway there are many many plots and this is the report for the hierarchical consensus clustering so when you type the the object it will print some information like the hierarchy of the classification and these functions that can be applied to this object again this density distribution for an each sample and this is not important and some values for every node so we'll see whether the classification is stable on the node and here we can see the hierarchy of the subgroup and you can also like adjust some parameters it see it will give you different uh different levels of this classification and then this like on every node each node is only a subset of samples where the standard consensus clustering is applied so if you check if you go to the result for individual node it basically it's a digest report for the standard consensus class result but only for the subset current subset of samples and the last example I want to show so these two examples they are all based on the gene expression data and the last example I want to show is the is applied to a methylation array data set so in general the use is very similar you just provide your your your matrix here but there's there are some other things we need to be careful here we need to set scale rows equals to force because this is a methylation data which we will not scale the rows and which and and in this example this data set has it's a central nervous system tumor data set it contains almost three thousand samples here we just set subset equals to five hundred which the done the done sampling uh consensus class will be automatically applied to every uh subset of the samples on every let's say node and I will also set a minimum difference of mass the minimum methylation difference because this is a mass selection data set and we we will we're only interested in the the subgroups which show large methylation differences and for the other basically the other settings basically are similar as it's the normal use of this hierarchical partitioning method and in the end they just call cola reports they will generate a very comprehensive report for your analysis and so basically that's all for the cola package and and all these examples are runnable so I will I will and so you can so if you if you check the the paper of cola you will find this code there I think in the supplementary files and um so if you have any questions you can ask me here or you can also write me an email or ask me uh on github thank you wonderful thank you do we have any questions in the audience okay I have one online from Ryan C Thompson how does the hierarchical clustering uh decide when to stop adding additional levels to the clustering hierarchy is it when k means chooses k equals one is the best or some other criterion and there are several um I can show you here in the report it you will see such so you will see there are some it tells you these are reasons where the the the hierarchy um process stops at a certain node so the c means the the clustering on the for example there are nine the 90 samples the contents of clustering and is not stable on these 60 samples so the clustering here stopped so these 60 samples are shaded as a node as a leaf of the uh hierarchy and the b is there are not the number there's also a minimum number of samples and as a cut off and c is there also a minimum number of signatures um and on the node so there are several um criterias to decide whether the hierarchy process stops yeah that's wonderful we have a comment from thomas uh bar dimens um the hierarchical expansion is great I used the standard cola clustering in the past and I found that the lower k was more stable while I had reason to believe that there were more populations this is going to help a lot thanks yeah but I want to see um we need to be careful to use this hierarchical content partitioning I would suggest first to use the normal uh normal consensus clustering and if you really think you have many many subgroups then you can try the hierarchical one right um I was just going to ask I saw you applied this to single cell RNA sequencing did you have any computational um requirements uh that you might suggest like such as player realization multi-threading um what were your takes on that and because and this package was originally designed for the bug RNA-seq data methylation data where we do not have so many samples and and we can I mean theoretically we can apply it to any type as well as these metrics and we can apply it to the single cell RNA-seq with let's say 2000 3000 cells but I think with more it will be very very slow and that's also why I implemented some sampling um and method and of course there's a parallelization is supported in the package that's great I got a question from Ryan Thompson again um this does the parallelization support a bio-parallel or only multi-core uh I use parallel parallel great thanks for all the questions oh and I have one in the back turn the camera on hello um so this looks like a very interesting method and um some of my colleagues like to use uh WGCNA um for finding um classifying genes into modules then once you have a module you can find an eigen gene ask you that eigen gene separate samples in some way so I think here we call it your like co-clustering right the samples and the genes so you can find um already genes that explain some clusters assuming you remove effects from like um cell type composition age other other covariates would you say that you we could compare the genes that you get for each of your modules against like WGCNA or would that be like an inappropriate comparison yeah thanks for your question because um my colleague also mentioned me to compare to WGCNA um I don't know because I haven't thought about that but I fear the feature selection method proposing and call out which is ATC I think it's kind of like because this is a correlation based method and WGCNA it's like code and expression network um because when it constructs co-experimental work it also calculates for every gene calculates the correlation to all the other genes I think there might be some similarities but I haven't looked at that yet and so I I I have a question um so does do do you find the in regards to a single cell data um do you find the normalization like does it really affect um the consensus partitioning clustering I don't know because I'm I have no experience with single cell and R6 data so the examples here basically where I use public and dataset where they already provide the code for normalizing single cell and data so I don't know how much it it will affect the consensus clustering okay because sometimes like people normalize by se transform people use like log transfer transformations and I was just wondering if like if you've just tried content consensus partitioning on different normalization methods but that's totally okay yeah thank you great talk by the way thank you any more questions in the room or I had one uh just a comment um from Ryan Thompson um this discussing um I would like to recommend looking to supporting by a parallel census will allow parallelization across multiple machines and compute clusters great initially I use an MCL apply and that's not good because and it will just focus the whole on our process then I I tend to snow I think I will I will I will check by the parallel yeah yeah it's a great it's a it's definitely um I think where there's a lot of discussion I've always used futures and for each so okay well um I don't see any more questions thank you for a wonderful uh presentation and um always uh you have amazing visualization so much appreciated great contribution yeah thank you too thank you