 So thank you for the opportunity to present this work today. I work in the group of cancer systems biology at Curie Institute. And my talk is related to the previous one, because we are also developing methods to analyze molecular profiles for cancer patients, mainly. And we are struggling with this molecular profile and trying to understand, again, how these different patients respond in very different ways to the same treatment, or how they are different, even if they are used with the same disease. And so I will try to discuss our concept of analyze these data, these molecular profiles, in terms of gene sets. Because now we are not only analyzing, measuring five or six markers, but this time we are measuring thousands of genes. And a transcriptomic or proteomic experiments contains the measure of 5,000 or 10,000 genes. So we really need to simplify all this data to understand how, if they can tell something about the disease and the patient. So this concept of gene set is becoming more and more popular in analyzing data, molecular profiles, because comparing different patients with the same disease, we are understanding more and more that the same patient and the same pathway can be affected in different individuals in different ways, meaning that the same senaline can be affected in one patient, maybe at the membrane level. In another patient, the same pathway can be affected on another molecule, more downstream in the senaline. And so the same senaline is affected, but these samples are more comparable at the pathway level rather than at the single gene level. This is why these gene set approaches are becoming more and more used to trying to dissect this type of data. This was an example where we have this cohort of colorectal cancer patients. And we know from previous models and from previous analysis in mouse models that the notch senaline is affected and in a different way between invasive and non-invasive tumors. So we want to verify in the human samples what is happening thanks to these very large projects called the cancer genome atlas, where you are access to public data on very large cohorts of patients. This is an international effort. And so we go to the data of colorectal cancer in these very large cohorts of hundreds of samples. And we divide according to clinical data our cohort in two groups, metastatic and non-metastatic, simply on the appearance of metastasis in these two groups. And we check for an expression of genes involved in the notch senaline for these two groups. And we observe for many genes that are known to be involved in notch senalines that there were no big difference at the single gene level. Is it a cancer culture in the gene cell from there, or it's cancer cells? They are biopsies in this. Yeah, exactly. This is the cancer sample. Right. And so again, the single gene of notch senaline we were not able to detect any difference in this senaline. So we try to develop a method that can catch information at the gene set level to understand how notch senaline is active or inactive in each sample. And the idea of gene set means that we need to define some groups of genes that are considered to have a coordinated expression. So this can be, for instance, targets of common transcription factors, for instance, some downstream targets, or genes that are involved in the same senaline process. So there are many different ways we can define these sets. And then as soon as we define these sets now, we can quantify the activity of the whole set. And again, there is also another big assumption in our method that is based on the unifactor linear model, meaning that we. But of course, this can be expanded and developed in a more complicated way. But the first approximation in our model is that the genes in the same set of genes are under the influence of one main factor that can be an oncogenic event or can be a drug perturbation. But we need to do this assumption. And if we are able to do this assumption, then we can approximate and rewrite the matrix of gene expression for all the genes in our set as the product of two vectors. So the first vector is the first metagenes, so the first principal component of the expression of the set of genes. And the second vector give us the level of these metagenes in different samples. So what happens is simply that the big matrix of many genes is now approximated by these two vectors, the vectors of coefficients that give us the metagen and the weights, so the level of these metagenes in our samples. And this model allows us to identify what is an over dispersed gene set. And this is a concept already presented in some recent literature, where we try to define which gene set according to our conditions express an excess of variance compared to the background. So meaning that if we test many different gene sets on our data, there are some that show high variability compared to a background random gene set. And so we consider that these sets are particularly active or inactive in our conditions. So of course we have to define this background expectation. And if we have these, we can identify these active or inactive the sets. All this is implemented in a computational way that allows us to do it automatic for many pathways. And maybe if we don't have any a priori knowledge, we can test a complete database of pathways that are now available. So if we have the expression data, so this big matrix with 10,000 genes measured in many conditions and we have this definition of sets of pathways or we want to define, we can run these analysis and at the end have these coefficients of over dispersion, meaning for each gene set, how we expected these to be over dispersed in our data according to the level of these metagenes in our samples. So there are some features that we try to introduce because of course, statistically, we need to take into account some facts. For instance, that a very small gene set is ever so likely to have these variants quite probably to have these variants by chance. And so these L1 variants really depends on the size of the set. And so we should avoid to have this very small set or to take it in consideration. And we do it by assess the new distribution by keeping random sets of genes with the same size and testing the variants of these random sets. If typically for small sets, the random sets are the same variants as the one we are testing. Another thing that we take into account is also that we can have different patterns of over dispersion. So we have some genes that contribute positively or negatively to the variants of this set. But we have some cases in which all genes are contributing in the same direction, like in this fact. Imagine a transcription factor that we don't know why, but here is only an activator and all the targets are overexpressed in this set. So in our case, we are able also to detect this type of pattern. And also, if we have already a biological a priori knowledge and we know that some genes are active or inactive, we can introduce this fixing some weights in the measurements to keep it into account our a priori knowledge on some pathways. And the last one thing is also that sometimes the first principal component is affected by outliers, so very special gene in the set that behaves very particularly. And so typically also, we would like to avoid, because in some cases, it's a very important marker in the pathway, so it can be interesting thing. But in many cases, it's a noise in the measurements, so we should avoid to have results relate too much on this type of results. And so we do some applications. I will show you at least two, I hope, if I have time. So the first one is this notch senaline in the colorectal cancer. As we know, this single expression is not particularly informative. But now if we measure the notch, and we include also some other pathways that were informative for us, WINT and P53 in our analysis, these are the results of our gene set quantification. So here is, for each point, we have a sample. And the level here is the score of our aroma tool for the notch senaline in this sample. So we can see that for aggressive tumors, the red ones, we have at least one group of tumors where this is not seem really active, while in the non-aggressive tumors, the situation is more, let's say, more distributed. And for P53s, it was really clear, so we have lost P53 targets in these patients compared to the other group. And WINT also was a significant result because seemed to be quite active in the invasive tumors rather than in the non-invasive ones. And so this was an example to measure these data in terms of gene sets. One other application we are doing more and more an application of this data is in analyzing transcriptomic data, checking the consistencies between the transcriptomic data in a mathematical model of metastasis that was developed in our group. In this diagram, this is a typical diagram showing a Boolean model of invasiveness. So what does it mean that we have some nodes that are connected by edges that can be active or inactive? And we try to model the process in terms of logical rules that can, according to some initial conditions, evolves towards different phenotypes. OK. And what this node could be, biological response to what? This is the, biological is what is known about some process, some signaling that is involved in invasiveness. And now these components are connected towards. Some example specifically, what information goes into the node. OK. Another ear is a model. Clinical information, what kind of information? It's not clinical information. This is biological information, meaning that we know in literature and in biological information that for instance, AKT-1 or AKT-2 are involved in the process of invasiveness. Now we put all the components together and we try to connect them according to inhibitory or activity influence. And then we give logical rules to evolve from initial conditions towards phenotypes, towards observables. This is the process of logical modeling. But again, this was done by my colleagues. But again, I can't try to explain how it works. But what we have done is now, OK, we have these components. But we would like also to see if these molecular profiles that are measured in our data fit with these components. If what we accept to be active is really expressed in our data, if what is expected to be inactive is down. And if we do again by gene by gene, we don't see this big difference. So again, we try to connect to each node in this model a set of genes, instead one single one. This is meaning that for TGF-beta, for instance, we try to measure the level of, by Roma, the TGF-beta targets. And the same for the other components of the network. And this seems to give a more consistent result, meaning that, again, some things that are expected to be inactive in the non-metastatic group becomes active in the metastatic one, et cetera. And this, again, was not possible to observe at the individual gene level. And I have five more minutes to present the final. OK, let's skip the third example. But again, all these tools are valuable if it can be of interest for other type of applications. And this is the working group that are collaborating to this project. Many of them are the cancer assistance biology group. And Eric Bonet now is a cancerer of all. And that's it. Thank you. Thanks a lot. OK. Yes? Oh, minus. Yeah. No, I'm not sure why you do this, because it's relevant with respect to the approach and interpretation of the data. In practice, now you find, for instance, the winged gene set is changed in a cynical way. What does that mean? It means the wing is in common cancer, but it might be that it's part of the winged gene set, it's part of a larger gene set, and essentially driven by the other components of the larger gene set. And so it might be that it's only a passenger of a larger change. That's what really counts with cancer. So what it means is you can't draw conclusions about the relevance of the changes in this gene set, simply from this data without the context. Yeah, the assumption, as I said, is this unifactor model. And as soon as you apply these methods to a cancer data set, you assume that the main driver event is an oncogenic event. So this is what really gives the data the variance. So of course, if you think that the main factor driving the variance of your data is someone else, so your conclusions about the cancer process are, of course, discussed. What does predictive level of your data show? How much you can predict actually from your data, from molecular data, predict the development of cancer or the efficiency of the method? Everything else suggests who cares. Yeah, now this is a big step is that we are able to discriminate between different risk groups in our patients. And so the aim or the big aim is now to be able to predict how some tumors will be more aggressive than other ones. In some projects, we are now able to at least identify clearly different groups that were not identified before. So this is at least is a first step, because now we are able to discriminate these kind of clusters of patients that were not identified before. They only identify both by the baby or molecular data, not by the clinic of the behavior. It was just a relation, indeed, to do what you do. Not complete, because then when you analyze some, for instance, you go inside some markers, and now you are able to see that really these markers, these senile, is indeed different between the groups that were not identified before. So you are putting evidence, some components of the process that were not identified before. OK. Could you expand why you focus in expression levels? If I understood correctly, as opposed to regulations on the pathway, phosphorylations, or other modifications? Right. The main thing is that the previous type of data mainly was transcriptional level, but the same methods are now applied to proteomics and phosphoproteomics data. The fact is that transcriptomic level was the first one where we were able to measure thousands of, so to have these genome-wide experiments. But now, for cancer patients, it's become reality also to have the proteomic level. And so we are more and more applying to other level of molecular. My question is, I mean, there are marginal differences on the expression levels, right? But if you will go for regulation, you actually tested it already? Which type of regulation? For instance, in phosphoproteomics data, what we are able to identify to analyze sets of substrates for a certain kinase, we are able to identify some kinases that are active or inactive in some genes, in some samples. So still we were able to analyze these data and interpret them in terms of active or inactive kinase or phosphatase. So those are stronger predictions? In, again, the fact is that we are able to clearly identify different phospho, different groups that behaves differently in the phosphoproteomic analysis. How we are predicting is we are not yet able now to say how good is the predictor because we should have a kind of patience where with a certain follow-up, et cetera, we are not yet able to do predictive models. But we are able to at least identify groups of different patients. Yes, sorry. Can I say something? You must say something. I must. So I just want to say that to defend Roma and Lavender, these are programs that they develop. I understand. I'm not a bioinformatics person. They develop for the use of our community. They are trying to talk to us. And it will be used always go to Roma. So there will be other ways that you can use Roma. That's what you're going to say, yes. Right, yeah. And in order to say the goal is not to understand biological processes, but to probably do some diagnostic or prediction. So that's why she's trying to get adverse separation between the groups so that it will be predictive of metastasis or something. OK, thank you very much. Thank you very much.