 So, we have Timo Achille and we'll be hearing about the RF ACE method for uncovering nonlinear associations from heterogeneous cancer data. Thank you, Timo. Thank you. Can you hear me okay? I also want to thank the organizers for holding such a great symposium. There have been some fabulous talks that everybody has heard and I also want to thank the organizers for the opportunity to give this presentation in this symposium, already looking forward to the next symposium. So, as you all know, the TCGA is about collecting large collection of data from different sources with which to interrogate the molecular mechanism driving the progression of cancer, gene expression data, microRNA data, methylation data, copy number data, protein domain, mutation information, clinical reports. But not only that, also outputs from analysis groups, for example, paradigm activities produced by the UCSC, various algorithm outputs built inside Firehose that you can take as yet another inputs to consider as features with which to explain the progression of cancer. So, my interest has been to focus on association studies with all these heterogeneous collections of data that has been given to us. But before one can even think of doing that, there needs to be some kind of mechanism or automated way to integrate all this data together into a feature matrix with which it is then easier to automatically uncover associations from the data. So, as an example, per cancer type, if one collects all the data for them, there are usually samples ranging from 100 close to 1000 and there may be tens of thousands of features where each feature can be a gene expression profile, microRNA profile, copy number variation profile, et cetera. And all the data is very heterogeneous in the sense that there may be some categorical variables per sample, numerical variables, binary variables, string literals, as for example in clinical reports, but also some missing values, so not for all samples you have data represented from all those sources. So there is an obvious problem that we need an algorithm which we to uncover associations given this heterogeneous collection of data. And actually to my mind, I don't know except for the random first algorithm that can handle natively that kind of heterogeneous data sets and there are some apparent pros with the random first algorithm besides supporting the mixed type data and missing values. You don't need to do that much data transformations and it also supports uncovering multivariate and nonlinear associations. So it's not just pairwise association analysis, but one can think of having the data suggesting that there are some multivariate relationships within and it would be nice to uncover those as well. And then some cons. The original random first algorithm may not be that good for our problems because there is no way to, for example, assess the statistical significance of those associations that one get, rather mere ranking of associations is provided with the importance core. Moreover, if one wants to furthermore consider prediction of new data given these associations, build predictors for new data. For example, given some new data, what would the survival be for a patient? So prediction tasks are not that easy to handle with the initial random first algorithm and also the current implementations lack flexibility more or less. So with that, we thought of implementing an improved random first algorithm based on already established variants of random forest, take these shortcomings into account. For example, implementation with added flexibility so that one could feed in string literals and give data in various formats, for example, that annotated feature matrix format is really handy for our analysis. So supports for that, that data is really important, that type of data. And in order to construct large association maps, meaning that if one wants to see that how all these features are related to each other, there needs to be a way to put all the individual association analysis runs together so there needs to be a metric that is comparable to other analysis runs. And also a statistical significance assessments are important so that you can provide a cutoff that which association is meaningful and which is not. And also to deal with prediction given new data, there is also that taken into account in the RFS algorithm so we have implemented a gradient boosting tree algorithm to take into account prediction. So if one considers an example with the colorectal data, the gene expression of PRAC, which is the prostate cancer susceptibility candidate known to be expressed in colorectal data and see what comes out when RFS is being run targeting that particular gene expression profile. Just a snapshot of the algorithm being run, it's a command line algorithm and you can see as an output here basically if the font is not too small that there are some information regarding the data dimensions, which feature has been selected as target, here is the feature header for PRAC, how much data is missing so approximately 50% of the data was missing but that doesn't matter because the algorithm can handle missing values and then the algorithm automatically selects suitable parameter values to be utilized in this particular run. So if you don't have any experience of decision tree based analysis, you can just simply feed in the data matrix, provide a target and run the analysis but if you have better experience on the algorithm you can manually fine tune the algorithm and in this particular run it was executed in less than 200 seconds and it could find 19 candidate associations with that exceed the significance threshold and if you look at the associations that come out in this particular example out of 19, the top three associations were Hux B13 which is a proximal gene to PRAC, anatomic organ subdivision which is basically indicating that PRAC expression varies as a function of location between colon and rectum and also its promoter methylation is related to the expression meaning that it's basically silencing the PRAC when methylation goes up. And as I mentioned that now that there are these significant associations you can also build a predictor based on this and see for example that how good these features are in explaining the PRAC expression. So going one step further this was actually computed also so you can take that as an alternative output that what is the prediction accuracy even that you have found these significantly associated features to PRAC and then you can build a predictor with gradient boosting trees and see that how well the data points for which there is non-missing value were predicted. So on x-axis you see the measured data and on y-axis you see the predicted data and the gradient boosting tree algorithm is inherently applying bootstrapping so as to avoid overfitting. And as I mentioned about 50% of the data were missing so one can then see that what the predictions would be for those missing data points and this plot essentially shows those predictions so they're on the same range but I haven't applied any further analysis to see what is the relevance of those predictions. So that was just one example of how you can apply RFAs and let's repeat the analysis for tumor stage. So there is one clinical feature tumor stage and you can rerun the analysis runs in less than two minutes or so and you can see that you get lymph nodes spread, number of lymph nodes and for example paradigm activity which we used as a candidate input for our association studies the paradigm activity ERCC4 was found to be significant feature and then some other features as well. And you can directly see well on x-axis you see the tumor stage so varying from one to nine that the lymph node features aren't very powerful in explaining tumor stage at lower tumor stage levels. So it's kind of flat whereas the paradigm activity has some variation on the lower range so you would assume that when you combine these different features together so as to better predict tumor stage that these features would complement each other and that's what seems to be happening in this case that on x-axis again you have the measured tumor stages and y-axis the predicted ones and again about 50% of the data was missing so the left over data were predicted here on the left and we have taken a step further now that there is an algorithm that can handle the TCGA data the heterogeneous data sets with missing values we have developed a pipeline most of the credit goes to Smulevich lab there many people involved in developing this pipeline where we in automated fashion generate these feature matrices, analyze, make a complete association maps with RFAs, store all the associations for all the cancer types that we have analyzed in a queryable database and complement data driven association information with literature, pathways and that kind of things and project everything onto a web service that can be used for exploring the massive amount of association information that has been generated with our approach. So as a summary the core of my presentation was to introduce you to RFAs which essentially combines good parts of from various established algorithms like the random forest algorithm gradient boosting trees and most importantly ACE which has been published a few years ago. So there are not that many new parts that we introduce rather we put them into an easily usable package. It's a generic and fast implementation of decision tree based learning and it seems to suit pretty well the TCGA data and novel aspects there are p-values assigned for associations and you can do pretty accurate prediction with gradient boosting trees and this is an open source project and you can look that web page up and it's also work in progress so there are some things that need to be improved and so on but I think we're pretty close in finishing implementing the algorithm. Many thanks go to Smulevich lab so most of the work has been done by members of Smulevich lab but also some people from Tampere University of Technology and yeah so thank you for your attention. Questions? Thank you great talk. One question about all these associations we know that when you're taking the genomic data and putting it as input to your algorithm there's a lot of correlations and really a lot of redundancy in that genomic data so the output you get the associations there's a lot of redundant information. What approaches are you taking to to to to to condense that redundant information and give a clearer answer to the questions you ask for the data? That's an excellent question. On the road map in developing this algorithm we are going to implement methods that take in particular those redundant redundancies into account so as to make compact sets of good predictors of whatever target you have chosen. It is not yet implemented in there so if you have many correlated features one can imagine that for example methylation probes if you use them as individual features they are highly correlated with each other so it's going to create some associations that are basically spanning across hundreds of methylation features and they all seem significant whereas they are highly redundant so that is still a bit of an issue depending on what data you are focusing on but we are certainly going to address that in future versions of RFAs. Andre? Thank you it's a very very interesting work. I was just wondering maybe I missed it but on your plot which I interpreted as a fitting curve and carried the heading saying training training data so did you really try to split your data set into a training and test set and evaluate the performance on the test set that you never used for training? Excellent question no not in these examples so I didn't do that but I should have done it I know. Okay let's thank Timo one more time and thank you.