 Hi, I'll be presenting today some work that myself and my colleague James have been putting together working with chaining classifiers So the motivation for this work has been that we've seen a reoccurring theme in cancer biology which is that cell-specific programs specifically for embryonic stem cells and certain parts of the mesenchym. Have been reoccurrently Errantly activated in in cancer cells where they really have no business going And so we wonder, you know, if you if you look at cancer as a disease of pathways rather as than a single mutations this makes sense because You you want to you want to program this robust the cancer that can co-opt and that induces proliferation, so it's no surprise that it has chosen embryonic stem cells But we're also actively looking for more of these programs and we want to develop a method that can that can detect this So our overall goal that our overall approach then is this We learn specific programs So so on the left you see that we've we've thrown in a bunch of stem cell expression data sets and we feed these to Classifiers that we have inherited from the machine learning community And we do this through a system what we call mechamon which tests a large number of feature selection methods and algorithms at once so that we can choose the optimal one for detecting a particular expression profile And then independently we take the cancer expression data sets such as the ones that we've got from you the TCGA and We normalize them in the same way Then we apply the classifier that we have trained using the embryonic stem cell data or various other data To the cancer genome To the cancer genomics to see if we can detect signal then proliferating in cancer that is erroneously activating so the first step of this is to Make sure that we have a robust signal and so here I've shown Each dot here represents a microarray that we have run through our classifiers and these are two independent classifiers one that has learned embryonic stem cells from their early progenitors and adult stem cells and Then on the on the y-axis you can see we've also been able to learn the difference between embryonic stem cells and induced pluripotent stem cells Which will be an interesting signal to keep a keep an eye on going forward because we know that embryonic stem cells are are more Are that induced pluripotent stem cells have been shown to be carcinogenic carcinogenics? So that that signal may also play out in in cancer But if you can keep in mind the the x-axis here is the one that we're concerned with This is the one that we've learned that is embryonic stem cell specific And so what we what we did is we applied each that that learner to each of the Expression data sets that has come out of the TCGA so far Or or five of them. I should say breast cancer colorectal cancer glioblastoma lung cancer and ovarian cancer and What we could see is that There is there is shift in the mean expression of the embryonic stem cell signature in all of these But the p values are somewhat middling except for colorectal and lung cancer So we know that this is not entirely true that there is Activation of these programs in these other types of cancer. So we went further and split it up a little further In breast cancer, for instance, we looked at the relative stemness of the different sub types and so Not surprisingly we found that while the luminal sub types are not significantly altered from the normal breast tissue The basal the basal sub types are and have a significant increase in this stem like quality However, we can also extend this this method to to signals beyond just stem cells here. We've learned so in this case we're learning on One type of cancer and applying it to another one. So in this case we've looked at Breast cancer and we trained a classifier to to recognize Luminal versus basal, which is a very easy task We so to prove this we held that we trained on 80% of the data and held out the the final 20% And what you can see is that that final 20% held out Very well separates. It's the green and the red here And red and blue are the luminal a and b And so so the green which is the basal is far is completely separated from the the the luminal But what's more interesting is that when we then go and apply this classifier to a variant cancer? We can see that first of all that it has a slightly more basal quality, but also that It separates the sub types of a variant cancer the bottom graph is zoom in just on the ovarian cancer and We can see that the immunoreactive Subtype of ovarian cancer has a much more basal like quality than the mesokinval So I just want to leave you with one idea of how we're going to continue this work We want to learn binary classifiers between each of this each of the cell types available to us So here we're pulling in the data from the gene expression atlas And we've got 52 normal cell types and for each cell type. We're learning We're learning a binary classifier for one versus one versus the next And so this gives us, you know An upper triangular matrix of classifiers for for each pair And at each this is just to remind you that at each point in this binary classifier We're optimizing on a large number of algorithms and feature selection methods so that we don't we don't bias ourselves to or attach ourselves to any one method It may we may find that in each of these tasks a different one of the algorithms is more appropriate So we want to take we want to be able to take that into account So then a new sample comes in say a cancer sample And what we're going to do is we run it on each of these classifiers individually and this And find out its log likelihood of being in either one of the two categories From from this then we define a signature for each of the cells Based on their performance in these classifiers and that signature becomes Can give us information about where it falls in the in these in the hierarchy of Development and also possibly what the tissue of origin is what errant cell programs are being activated and what other Correlation to other interesting clinical variables With that I'd like to thank my advisors Josh Stewart David Husser and also the other machine learners In my group Crescito Artem and Sam And I'd also like to thank TCGA and the California Institute for regenerative medicine for funding Thanks questions so it'll be really interesting to look at how the Polycomb methylation changes that we see interact with these Expression signatures. Yeah, definitely. I mean we've already seen looking pulling out some of the pathways some of the main differentiation pathways and And the ones that are important in keeping pluripotency are very active in cancer for sure Question um, this is Gordon Saxena from Broad. How are you addressing the overfitting issue? Which overfitting issue I guess in terms of that I mean are you keeping out some data? Yes So so each time we evaluate each of the binary classifiers. We That our method for evaluating is is cross-fold validation. So we use Usually, you know three times five cross-fold validation and that's just a matter of selecting the best one now Then then when we train the after that then yeah So so the actual the actual final classifier that we keep is the one with the best performance in the cross-validation But it's trained on all the data. So it will be over fit to that particular That particular data set but then when we apply it to other data sets is capturing the most biological data that we possibly can Right, okay, and then then you go back and look at it when when as more data comes in yes If there are no further questions, and thank you Daniel, and we'll move on to the next presentation