 Thanks for the introduction and thanks for coming here. So, yeah, sorry for the four. I think I don't even have the logos of the institutions you mentioned because you mentioned in CERN. Yeah, but I forgot the logo. Sorry. No, the reason why there are so many logos here is just that physically I just walk in the same place in the same street with them. But let's say that ENS is more in the math CS domain. Tukuri is a hospital and research center with biology. And mine's paris-tech is computational biology. So I try to work at the interface of these domains. All right, so my talk may be a little bit different from many of the talks this week. I'm not a biologist for sure. What I would like to discuss is more some difficulties we have when we deal with genomic data and how we try to overcome some of them. And more particularly, I will talk of how can we do inference, how can we learn from genomic data because it seems to make sense nowadays when we want to study biology to generate data, we generate lots of data and we say, well, from the data we're gonna learn something. And I will discuss this problem of learning from data and why it's not so obvious. And to be concrete, I will just, you know, it's a session about health. I will motivate my applications with health applications, in particular in personalized or precision medicine, for example, in cancer. So there is a view that nowadays we can improve the way we manage the disease and we treat patients by capturing lots of data about each individual. So someone with a cancer, we can now sequence the person, read the genome of the tumor, read the genome of the person, measure proteins to imaging so we can collect lots of data about each individual. And by looking at the data, we can observe that there is a big diversity among cancers, not two persons have the same cancer at the molecular level. So maybe by analyzing all these very precise data, we can predict or suggest specific treatments, specific ways to approach the disease. And for example, give different drugs or different treatments to different persons based on their molecular profiles. Okay, so there has been some success in that without going to all the full genome sequencing. It's well known that some markers, like the expression of some proteins, sometimes is predictive for the response to some treatment. So this is used in the hospital already. And what we would like to do overall is to go a bit further, not look at one or two proteins, but maybe look at the full genome, look at the full images, look at the full proteome, and end up with, let's say, algorithms that would take these data as inputs and make a suggestion of a treatment to give, okay? So the difficulty here is that we don't know what should be the algorithm. If I give you all these data and I ask you what drug should you give, it's not obvious. We have some knowledge, but it's very partial. And so we would like to automatize the process and design algorithms or programs that would do something automatically. And the way science proceeds these days is to say maybe you can just collect lots of data about many people, observe how the disease behaves or the response to different therapies and try to find associations, correlations between what we measure and the fact that some drugs work or not, for example. And then if we capture some association, maybe we can say that based on what we have observed in large cohorts of patients, on clinical trials, for example, then we can suggest that, for example, people who express a particular protein or people with particular mutations tend to respond well to the blue drug, and therefore we can suggest that the people should be given the blue drug. Okay, so it makes sense to have this approach. And so this is typically a machine, what we call machine learning approach or statistics called approach where we are not sure to understand why the blue drug would work, but just from empirical observations, we can find some association between what we measure as input and what the output is, right? So I'm gonna talk about this process of how to design an algorithm that would decide what to suggest as output from the input. So what drug should we give from the, let's say, genomic information? And more precisely, how this algorithm can be designed by analyzing lots of data where we have collected information about patients and the corresponding response. Now, statistically speaking, it's a very standard and classical problem called regression or classification. And for example, in a textbook in statistics or machine learning, you often see these kind of pictures where this is an abstraction of the problem or what you would say, the problem we have to solve is the following. We have collected data about a number of patients with cancer. Let's say we have given them some blue drugs and we have also that sometimes the drug works and sometimes it doesn't work. Then what we want to learn or to infer is a rule that would predict the effectiveness of the drug from the input, right? So mathematically or visually, you can imagine that you have points where each point would be a patient here. The position of a point I plotted in two dimension here just in the space, as if I measure the two values like the expression of two genes. So the X coordinate would be one gene. The Y coordinate would be a second gene. Then you can plot the patients and then they have colors that would be, let's say, they respond or they do not respond to the drug. And then the statistical question or the inference process is from this picture, can you learn a rule that then could be applied to predict the color of each of the patients? And so this picture is quite obvious and your brain does that all the time. That's why we call that machine learning referencing to the brain, the capacity to learn. We observe that there is a trend to have black dots here on the upper right and white dots on the lower left. And therefore that maybe there is a rule like this one that separates the responders from the non-responders. And so if you can design this line, then in the future, when we see a new patient, we could predict the color of the patient from its position and then decide that maybe we should give the drug to this person and not to the other one, right? So this is a well-defined problem and well-understood problem. And this picture here is a solved problem in mathematics and statistics. It has been solved 100 years ago. Logistic regression or more recent fancy approaches, decision trees, for example, solve this problem exactly. Now, when I say it's solved, it's solved especially when you have, let's say 19 points in two dimensions. You see the 19 points that are here, you have in 2D it's easy. Now, if you apply this approach to the genomic data, there is a problem, which is that the genomic data are not really 19 patients in two dimensions because in genomic data, typically for each patient, you don't measure two genes, but what we want is to measure a lot of things. This is why we talk of genomics, right? It's not only two proteins, it's more we can measure all the proteins, all the genes, all the mutations in the day, images, et cetera. So you can imagine that now a point, a patient is not a point in 2D, but it may be a point in one million dimensions if we measure the mutations, for example, right? So you have considered the same problem as before you have a cohort of patients, let's say black and white in the sense of responders and non-responders, but now each point is not a point in 2D, it's a point in one million dimensions. And when you think of the process of finding this line that separates the two types of points, in 2D it's easy because it's super simple to have a line that separates these two things, but in higher dimensions, there are an infinite number of ways to separate 19 points in one million dimensions, right? And suddenly you can show even mathematically, statistically, that the problem becomes ill-posed. There is no easy solution to separate, I mean, there are too many solutions in a sense to separate the black from the white and at least to a process that we call overfitting in a sense that is very easy to separate the ones we see, but it's very hard to ensure that the rule we have found will be good for future patients because in a sense there are so many ways to separate the black from the white that you pick one of them, but there is no reason why it should be a good one and why it should be the one that predicts that thing. Okay, so it's a long introduction for something well known which is that doing statistics in what we call in high dimension is hard when we don't have enough points and the standard situation in genomics is really this one, is that we typically, we don't have 19 patients, but sometimes we have 100 or 1000, sometimes 10,000 patients. It sounds a lot when you are at the hospital when you say you have sequenced 10,000 people, you know, that's sort of the massive investment, it takes time and it's, these are people, but mathematically a thousand points in a million dimensions is not a good news, right? It's really the situation which is hard. All right, so is it just a conceptual issue? Well, I think not only because there are signs that something doesn't really work in many genomic studies we do or we publish. Just to give you- Which cancer was it, the 19th patient? Oh, it was not a cancer, this one is. So this one is a theoretical model just to illustrate the problem. Now, these are real data, so I will talk more precisely of some cancer data. This could be breast cancer, and you know, we can have data, a category on public data with 2000 samples and the response to some. Can I have a question about this? So actually, I thought that the more patients you have, if you just could stay within the two, the more patients actually better. Yes. Because you get the curve better. Sure. So couldn't you, the machine, not you, running always two on the cord of patients with the same disease, you just do two, two, two at a time and then segregate it by this. Everybody who responds to two is either here or here. For two, you will know where to put it, so why do you need to do it in a million dimensions? Well, if you do that, I mean either you just decide by yourself that you will just focus on two genes and then you have one rule. So the gene will go through every two, you start with one, every two, this is why they're... So you can do that, but then you still hit the same problem, which is that then, so suppose you pick two genes among a million, then you have a million times a million divided by two ways to do that. So you have many of these pictures, like 10 to the 12 such pictures, and among them, many of them will have a perfect separation, just because you try so many, it's a question of multiple tests. The gray area, let's say, give the five years real life. There will be patients with four and a half, things like that, or close to five, so there will be a lot of gray area. No, less than five. Yeah, but what I say is that just for, for just purely mathematical reasons, if you try 10 to the 12 2D plots, you will have maybe 10 to the nine with perfect separations just because you try so many. And so then the question is, what do you do with that? So if you say, I try all of them and I pick the best one, then this is overfitting. You will, it will look like you have found two magic genes that separate the patients, but when you try a new patient, it will not work and you will ask yourself why? It's because you did not correct for multiple testing or you have this problem of having too many things to try. So we have a problem and you're gonna solve it. Exactly. But maybe, so maybe I will, before that I will just tell you that it's not only a theoretical problem and for example, let's say breast cancer. So this is one of the applications that people have looked the most about. Can we use not all genomics, for example, gene expression data? So now we can measure the expression of 20,000 genes. So we can map each patient in 20,000 dimensions and say, can we separate? Here, I will just show not a drug, but the risk of relapse, which is important when someone has breast cancer. If you can evaluate precisely the risk of relapse, then it can inform whether you should give chemotherapy or not so it can really impact the treatment. So people have been very excited about using genomic information for that to replace what has been done in the clinics for many years, which is to make an estimation of the risk of relapse from the age of the person or the size of the tumor, which works to some extent, but it's not super precise. Now, there are even products that are used in the hospital. People use something called MamaPrint. It's a test that has been designed from this analysis. It's a bit a controversial issue because some people say it brings a lot and some people say at the end of the day the genomic information doesn't bring a lot. And if we look, for example, at some publication, there has been some competitions or where to assess as objectively as possible whether it works or not to use genomic data. Some people have said, let's collect lots of data. So lots here is typically a thousand or 2,000 samples and ask people to predict the risk of relapse and people typically through competition on the web, et cetera. And this is just one example of a competition that was a big one where people have tried to predict the risk of relapse either from genomic data or using the age of the person and the size of the tumor and a few markers. And here you have a summary of the results. I will not go through all the details, but vertically you have the performance or how well it works, right? This is a score. It's a concordance index, whatever it is. So the higher the better. The higher means that you have a good prediction on new samples, not on the ones you see, but on the future samples. And these are the different columns here of the performance of different models that were tried by different teams. And the very bad news here for me or for the community is that when you take the teams who just use the good old clinical factors, so the age of the person, the size of the tumor, there is some level of performance, let's say 62. So this is the state of the art in 1999, before we discovered we sequenced the human genome. Now a few teams say let's replace that by genomic data, right? We have 20,000 genes, that's wonderful. And here they are, right? So there were not too many teams, here there were two teams, but on average they reached 60%. So they lost performance compared to the good old clinical factors. This is disturbing, I would say. Now many teams say, well, you know, maybe we should combine them because we have the clinical factors and let's add in addition to that the molecular data. And then it goes up again at a level which in this case is the same as clinical factors. Okay? So this is almost multi-billion dollar industry to design genomic tests and say we will use these tests to predict the relapse. But when you look at that, I wonder whether... So this is why I say there is some controversy. Some doctors say it doesn't bring much to use the genomic test compared to what we know. Others say it brings. In fact, the devil is in the detail because even using clinical factors, some hospitals may be using this model, which doesn't work. Some other hospitals may be using this model. So there are also ways to train differently the models. There is not a single way to do it. It's not only a coax regression on that, but overall the information content of genomic data doesn't seem to be very strong in this case. Something that we observed, so we and others have studied this data, what we observe is that in fact the performance here, for example, increases with the number of samples. So what you said is correct. When we have more samples, it works better. And what we observe is that when we have 2,000 samples, we are still increasing. So it doesn't mean that in the future it will stay so low, but it's clear that currently, even though it seems to be a lot, we have 2,000 samples and maybe we can reach 10,000 or 50,000 now, there is a limit in the performance which is not due to the data. The data are here, they are nice, but which is due to the inference process, which is that maybe it could be much better if we could train on a billion patients, but we will never have that. And for the moment, there is some limitation just for not biological reasons, not medical reasons, but statistical reasons. The inference process is limited. So if you look just at the clinical data, you just said they may be using different tests or different analytical approaches. Haven't they used statistical, I would have thought they would have used statistical approaches to eliminate the bad approaches and to... Yeah, sure. So I mean, this plot is just a real plot of the results of a competition on the web. So it's just what happens in a real life. You give some data to 50 teams and you ask them, build me a predictive model. So one guy will say, well, it's easy, I make a coax progression model on this data. The other one will say, well, I will do the same, but I will transform my data by taking the iterations. Someone else will say, I will do a survivor on them for us, et cetera. And everybody believes they have a good model. You know, on the data, they do cross-aridation, they test it, but it's just now when you test it on new data, which is the performance here, you observe some strong versions. Maybe, I mean, these ones are very bad. So you can suppose that there were students or people who didn't know much about it and they made a mistake, maybe. But let's say overall, so this level here is the standard level in this case of Prescott. Ideally, what you'd want to do is combine the best clinical approach with the best molecular, you know. Yes. Sure. But so something I should say is there is, I think there is also a problem in this kind of publication. So for example, in this publication, the conclusion surprisingly was not that, you know, clinical is good enough. The conclusion was, look, many teams try to combine clinical and molecular and prior knowledge. And look who won the challenge is this guy, right? And the conclusion was therefore, there is some information in the molecular data that is not present in the clinical data. Now if you look from here, in terms of distribution, they don't seem different. It's just that more teams were here. So that's the case where concluding that this approach is better because the winner is here. Doesn't really make sense, right? If you have more teams here, you will go to 60s. The same level. Is there any significant difference between just molecular and molecular and prior knowledge? So just molecular is not a good one because nobody, I mean, there are, so in this data there are just two points. So there is no significant difference, right? It's, I mean, here we do statistics among teams. You know, it's a way to do science now. We do open science. We say, let's give data to many people and do statistics, assuming that people, you know, are as good, excuse me. Just do prior knowledge. What is prior knowledge? Sorry, so prior knowledge is... It's supposed to clinical. Yeah, so prior knowledge is using the molecular data, but asking someone who knows about the genes what he thinks are the good genes. So for example, it could be instead of using all the genes, I pick only the genes which I know are involved in cell cycle, or, you know, I use a network of this kind of. So if you just use prior knowledge without the molecular aspect. So, I mean, here, sorry, here the prior knowledge. Here is some prior knowledge used to analyze molecular data. This is in this. And it's not better than just the machine doing molecular data, even though it's only two points. If you compare molecular... This one? Yeah, sorry. Yeah, so you mean without, if we don't use the clinical factors, if we don't use the size of the tumor, then it's very hard to get a good model, right? I mean, these two things. And here is, I just blindly train a model in 25,000 dimensions. That's one way here. And here, I say in addition to that, maybe I should just focus on a few genes which I believe are important. And you ask doctors. And you don't do better. And there is noise on there, right? So you have all these disturbing facts which, which contradict the intuitions sometimes of, you know, why it should work, et cetera, but that are here. Now, very quickly, a second disturbing fact and this one I will explain next time. What is being predicted? What specific feature of cancer being predicted? So in this case, sorry, the goal was to predict the risk that there is a relapse to cancer. So this is breast cancer. Metastasis. Metastasis, metastasis. Metastasis. Within a certain time. Within a certain time, yeah. Well, yes and no. I mean, this thing is called what, it's called a survival model. So, you know, for each person, we have the person for which cancer was diagnosed and then it's followed. A regional stage for the diagnosis was known. Crucial, right, the time of diagnosis. Where it was diagnosed, what stage? If this was... It is our early stage. So, early, maybe very different how much it was. That's crucial point. This crucial parameter, it probably beats everybody. Yeah, sure, sure. So it's an after-dict. Exactly. Like molecular data, the stage. Exactly. So, I mean, you're completely right that, you know, the reason why clinical data are good is because they have been used forever because doctors have observed that, of course, when I say the size of the tumor, it's not a joke that the size of the tumor is the most predictive factor in this case. We know that if you have a small tumor compared to a big one, the risk is different. So you should use that. And what I just say is that you cannot replace it by the gene expression. Even though there is signal in the gene expression, but in this case, you know, there is no... I mean, the summary is that there doesn't seem to be more signal in this blood, at least, in all the genes than in just using what we call the stage of the cancer. Size. It may involve the number of particular class of cells, if they're being counted or not. How many cells in the size? How many they may be affected? Yeah, no, sorry. When I talk about the size here, it's really in centimeters. How big the tumor was. Look at the tissue. They may have different composition. So in which of the tests we're being involved? This parameter is a crucial... These parameters are not used here, I think. Maybe they are hidden in the molecular data. No, but what is the logical analysis? It's not involved or not? It's not here. I mean, among the clinical data, you have, like, the expression of three proteins here. The clinical data. No, but the hypothesis, it's just clinical. A doctor who observed, somebody observed under microscope for understanding. It's involved in clinical. Yes. Except that it's super summarized. Right? They look at the image, they do many things, and they summarize that in five numbers, which is, does it express estrogen receptor, does it express progesterone, et cetera, et cetera? People who are histological, you look at the microscope and see that you're the shape of the new... Yeah, yeah, sure, sure. As I said, this was done in the clinical data, but this is summarized in what we call the grade of the tumor, if you want. And the grade itself is a mixture of the number of cells in mitosis, et cetera. But here is just a number between one and four. Right? Okay, so second observation, and this one also is completely not new, but it's still disturbing, is that in the early days of genomics, researchers are saying, let's look at the genomics, so let's do what I showed previously. We have molecular data, so we can fit the model to predict the risk of relapse. And maybe we don't need to have all the genes. Maybe a few genes should be sufficient, even if you're a doctor, a biologist. Of course, nobody believes that the 20,000 genes are useful to predict the risk of relapse, right? So people have said, well, we can use nice methods to select genes and build what's called molecular signatures, meaning a subset of the genes, whose expression would be enough to have a good prediction of the risk of relapse. And so as early, in the early 2000 years, several teams did the same things, basically. They said, let's look at breast cancer, let's look at the good genes which allow to predict the risk of relapse. Let's publish it and let's make product. You know, I say mama print, this is this one, this is a product in the clinic now. What's disturbing, and it's not new, everybody knows that in the field, is that, for example, you have two different teams focusing on the same problem, listing their magic 70 genes or 76 genes, and when you compare the list of genes, there are three in common, right? Is it a lot or not? Well, this is what you expect if you randomly pick 70 genes out of 20,000. On average, you get three in common, right? Yeah, but this one is more specific for the lymph node metastasis. Yes. So this one is more general, so maybe that's... So that's a good point. So I mean, there are many reasons where it could be, it's not exactly the same problem, right? It's a different kind of metastasis. It's not the same technology, it's not the same cohort, so there are many reasons why, you know, it should not be the same. Now, so I will not talk more about that because I want to talk about something else, but we and others still were surprised by that, and so some people say, let's, for example, look just at this cohort. Here, there are 300 patients, and let's simulate if there is enough power in the data, that is called power, to still discover good signatures. For example, instead of comparing this paper with this one, we just take this paper, and we randomly split the cohort in two subcoherts. So you have 300 patients, you randomly cut it in two parts, 150 versus 150, and you train two signatures. So here is the same cohort, the same technology, the same endpoint, and you compare the genes. And what do you get? You get three in common, okay? So basically, and whatever method you use to select the genes, you can do a T test, you can do complicated methods, but you never get more than that just because, and it has been also investigated and published by other teams. So you compare by expression, or by sequencing of the genes, what do you do? Here it's expression. No sequencing genes, not the indication. This is very old paper, this is macro array. Yeah, it was in 2005. So gene expression, they just imagine you have these matrices with 300 patients and 20,000 genes, and just randomly split the two, estimate two signatures, in the same cohort, you get two different signatures. In fact, there were some papers showing that, for example, if you randomly pick 70 genes and build a model on that, then you get the same performance at the end, right? So what it means, and it's not really surprising nowadays, is that you should not focus too much on the 70 genes here, on the 76 genes here, because you can pick any other subset or 70 genes, more or less it will work the same, right? So these are not the magic genes. Now what has been observed to and quantifies that if you don't have 300 patients right here, but if you increase that to 500, 1,000, 2,000, then this number increases, right? Because it's pretty statistical here, it's because of the correlation structure among the genes, it's just a hard problem to identify with certainty some genes which have predictive signals, and the statistical problem becomes easier with more samples. And here, in terms of how many samples you have to fit a model and how many dimensions you have and the correction structure, it's just usually extremely hard and empirically you get this kind of thing. All right, so it was a long introduction just to justify the fact that there are challenges here and in short, it's not enough to say I will sequence or do gene expression on my cohort and then I will find the good genes or make a good model, it's a bit more complicated than that. So what I will therefore very quickly discuss and I will, I'm sorry I spent too much time, but I think it's probably discussed that, discuss what some more mathematical or straight ways to try to just to make some progress in this problem which is not biological here, it's how do you get extra knowledge from this data which have very sometimes strange geometry or topology in the data. So I will discuss two things. One is about, and they're very, very standard but I will illustrate them on some things we did but it's a broader topic. One is showing that what we call regularization in learning is important and can be adapted to the problem. And the second is that maybe instead of taking the raw data, typically when you have a patient you say I measure the transcriptum, so it's a vector of numbers, maybe it's not a good representation and maybe doing something else like, I will explain what this is, changes the geometry and makes the inference process easier. Right, the key point is that things here require some assumptions, some prior knowledge but if you don't do them, then it will not work. You cannot just trust the data, we are not in the setting where there are enough data to automatically learn everything. So it really makes a difference to do something specific on the data. And of course these are not definite answers, I mean these are just things we did but it's many things can be done. Okay, so regularization and representation. So regularization, what is it? Maybe many of you are not familiar with statistics and machine learning but it's just a concept that you remember when I showed the first picture with the points and the line, let's try to formalize a bit the process here. So the process is that we have observations which are vectors. So here I have, you know, look at the data. So this is a real data, a zoom in on some real data. So now we have samples which would be patient. So each, imagine this is a matrix of numbers representing gene expression values. So each row here is a person, a patient. Each column is one gene. So you have 20,000 columns, let's say a few hundreds rows. And you have a response here which is what I call the Y variable which here we take as binary. Imagine that this is one if the patients relapse and zero if they don't relapse. Okay. This one, it's not important but this one is breast cancer. So remember I said when we observe this data, this is another way to say we have points in 20 dimensions with black and white points. Let's fit a line to separate them. So a line mathematically is just a linear function. So here the goal would be to say, let's try to infer a linear function in this space. So it would be just if X represents a patient, it's a linear function, I write beta transpose X to say just a linear function, the sum of beta XI. And the goal is to infer a beta. So beta would be, let's say the slope of the line. So the direction of the hyperplane in this case to try to separate the, let's say the plus one samples versus the zero samples, the two colors. Okay, I showed in 2D that it's easy to find a line and I said it's a solved problem. Here it's not solved in high dimensions because there are many, many hyperplanes, many, many lines that can separate just a few points from a few other points in high dimension. So how do we solve that in practice? Well, the standard state of the art is to say, let's regularize and regularizing means that we write an objective function. So we do some optimization, optimization over the possible lines, saying that we're looking for beta. So beta is the direction of the hyperplane that minimizes a sum of two terms. I will not detail what are the two terms, but the first term is just a term that measures how well beta separates the two things. So this is a measure of how well you separate the two things, so this is the standard of things. It implicitly says you create the arithmetic which is not just a fact. Yeah, yeah. Which is not by no means just a fact. I agree, exactly. Right, so this would be, let's try to separate the data and if you do that, I said many times it doesn't work because it's a ill-posed problem. There are many ways to have a perfect separation of these points in high dimension. So the solution that machine learning and Celsius has found is to say, let's penalize this by a second term which would be only depending on beta. This one does not depend on the data, right? It's some kind of prior penalty we put on any classifier, any line, and if we minimize both of them, then if we choose correctly the penalty here, it ends up with a well-posed optimization problem that will have a unique solution and will decide what is this line. Okay, so this is maybe, if you remember so that before, it may be a bit abstract, but let's be concrete. What are these penalties typically which are implemented in all the softwares we use every day, you know, when you do test kinases? Maybe the most standard two penalties are just norms, right? So beta is a vector and the standard penalties used are just the Euclidean norm, squared if you want, or what's called the L1 norm, which is, so the Euclidean norm is the sum of the squared coefficients and the L1 norm is the sum of absolute values of the coefficients, so these are two norms over vectors, here you can show in the space, if you have two dimensions, the light blue shapes here are the unit balls of these two norms, so you have the L2 norm is just a circle, L1 norm is more diamond shaped here, and so typically these two things are used as penalties to regularization, and if you do that, it has names, then it's called rigid regression, rigid resistive regression, support vector machines, all these things that work on average involve these two things. Now what you see is that, and these are pretty generic, it's not about biology here, there is no notion of gene whatsoever, but these are already quite good at ensuring that in high dimension you have a well-posed problem and it does something, and just let's comment on that, so they have names, you know, the L1 norm is often called the lasso regression and the L2 norm is called the rigid regression, even though they look a bit the same, they are just two norms, when you use them in practice, they end up with very, very different models, and in particular, for geometric reason, it's known that if you penalize with the L1 norm, which is non-differentiable at some points, then the solution of the optimization, so your final model will be sparse, meaning that the vector of weights will have many, many zeros, and you can control the number of zeros with the penalty parameter here, whereas if you use the L2 norm as penalty, it will not be sparse, so if you translate that in terms of genes, it means that this one will lead to gene selection because at the end of the day, your model, even though it fits a model in 20,000 dimensions, will just contain a few non-zero components, this will be the selected genes, this one will be non-zero everywhere, there is no gene selection. So, since there are probably other non-mathematicians in the audience, doesn't it worry you that you get very different results when you change this arbitrary panel or semi-arbitrary penalty function? I mean, if you would converge to the same answer, you would say, well, it doesn't matter so much for the penalty. Yes, so indeed, so if we were in the easy situation with many samples in small dimension, what we will observe is that the choice of the penalty is not important and probably you would not even need a penalty. So here, something I didn't say is that there is a weight lambda that you fix, that balances how much penalty you put compared to how much you believe in the data. So in easy situations, all the penalties converge to the same solution which should be the ones you get without penalties and indeed, as you say, in the situation where we are in genomics, where we need to penalize, then you get very different solutions. When you look at the vectors, they are very, very different except there is no statistical way to say one is better than the others. And here we enter the field where we have to make bets or prior assumptions and typically ask, assume that we believe that a good model should be sparse. This huge error depends on how many parameters, square root of n, it's an inevitable error in all these dimensions, square root of n. So error in this understanding grows in the dimension of picture face. So I think it is a big problem with all these methods. They are at heart models and sometimes they are miraculously developed but there is no reason why they should work. Yeah, I mean, it's... So, okay, now another question is, okay, in genomics, should we use the L1 or L2 ones? Anybody else? Just trying to choose to see which one will fit better to separate, yes? But none of them may work. I mean, this is a confusing point. You have to invent the right way. You take it from textbooks, look at what people use them. It's absolutely magic. Yeah, and in fact, you know, I just put L1 and L2, you can imagine any norm. It's very... It's random. You choose to random numbers. You have to invent something better. Yes. So for example, and so what I want to illustrate is that in addition to that, the reason, I mean, whether L1 or L2 or something will work is sometimes very non-intuitive. And so for example, let's say the example of our breast cancer prognosis, so we have breast cancer. We want to fit a model. So I say, you need to regularize. Take a textbook and they say, well, there are two nice ways. You can do lasso regression. You can do ridge regression, et cetera. And lasso is due to a signature solution of genes. So I think there is a consensus in the biologists or doctors that probably if there was a true model, like if we had enough data, probably the true model would be sparse because we don't believe that all the genes have predictive signal. It should contain a few things. This is a nice idea. So if you translate that into our inference problem, it would suggest that probably L1, if you had to choose, maybe a better one because L1 will lead to a sparse model and L2 will not. And so that's why that's what everybody did. Like I mentioned the mama print or the gene signature with 70 genes. It was done. It was not lasso. It was even simpler one by one selection, but it was based on the assumption that there is no need to have all the genes. A few should be enough. Now if you test that on data, surprisingly, I will not explain the details, but you can do experiments where you plot some performance of your model. So you have ways, you have data, you train on a subset of the data, and then you test on some other subset to evaluate the generalization performance. And we can compare models which are based on a few genes or on all the genes with ridge regression. And if you plot the performance as a function of how many genes you have in your model. So typically, you know you balance lasso and ridge and you say I can make a signature with 70 genes, which is the standard in this field. I get a performance. Or maybe I can add more, but I need to regard it with L2. So you did correctly. And what you observe is that in fact, in this case, the performance just increases with more genes. Right? And it's also disappointing observation that focusing on 70 gene signatures gives you some signal, which is not too bad, but the performance will be better if you kept all the genes. So this thing is not entreaty because if we had converged through a true model, myself, I don't believe we should need all the genes. But something happens here, which is just that because we don't have many samples and because of many things we don't understand, empirically, is the other regularization that gives better performance. And probably this has to do with the fact that, you know, I said that it's very hard to be consistent in the gene selection. Sometimes when you pick 70 genes in some data, you end up with 70 other genes on the data. And this lack of robustness probably is related to the fact that the performance is not optimal and that is better because it's hard to choose the good ones to just keep all of them and just apply a penalty that allows you to learn in high dimension. Even though you are not sparse. They can see it because some genes may not look specific for cancer but they may involve them in ribosomes. And they afford to be made very sensitive. They may use a lot of genes not directly related but highly expressed and so they will give this effect. Not because they are so important but because they express in large quantities. Yeah, sure. But that, you know, this is not here. The only thing that we can test is, I mean, me at least is the performance and we just... It's absolute expression or relative compared to the healthy tissue. So what you put? Just put normalised numbers and you're completely in the result. In this case, this is where they said the absolute expressions even though absolute means that they have been processed by some preprocessing... Normalised by the control, right? Yeah. I need to interview. From now on, we'll ask only short questions, okay? Otherwise, we will never finish. Short questions. If I went into your data and mislabeled a few patients, wouldn't that... So you're trying to learn from the data with a little bit of noise in labels of patients. Yeah. Wouldn't that look... You mean the performance would decrease or... Yeah, sure. So, I mean, the noise in the labels in general is something... I mean, all the methods that we work with are supposed to learn with noise in the data. So in a sense, the picture when I had the black and the white was not the correct one. Here, when I say... If I come back here, we balance some of two terms. One is how well we fit the data. And the second is how good we are with respect to our prior knowledge or prior hypothesis. And here, the fit to the data does not impose that you make no error. You know, it says, well, we can accept errors and sometimes it's better to make a few errors on the training set if we get a better penalty. I'm not talking about the errors of your algorithm. It's a long question already. Ask him later. Yeah, it's already long. You're trying to fit the mistake. Yeah. Now, I don't say it's good to fit the mistake, but we all know that the mistakes are part of what I call the noise in the training data. But then we can follow up in the coffee. Okay, so I understand I have to spare a bit. And I will not have long questions anymore. So what I wanted to say is that, you know, here something happens, it's needed to do that. It's sometimes counterintuitive to know what penalty works or doesn't work. And so some people, our group and many others have said that maybe if we are confronted with genomic data for which we know a lot of things, we know the genes, we know the ribosomes, the cell cycle, et cetera, maybe it's possible to use this prior knowledge to design different penalties, a bit less naive than that, because this one, you know, you have gene 1, gene 2, it's isotropic in all directions. And so things that we have proposed, for example, and others are, for example, if you know a gene network, like if as a prior knowledge, you have, you know, that some genes... Protein networks. This one is a picture of a protein network, yeah. But, you know, for my purpose, I just say that you have genes and connections that could be physical interactions, that could be pathways, that could be many things. Could this knowledge be put in the prior so that you can drive the selection of models, let's say, to be compatible or coherent with what you believe happens in the cell? And if you're correct, this may be a way to help the inference process, right? In a sense, you could reduce dimensions focusing on what you want. You said that here, let's say you have a graph and you want to use that to constrain your model. How can you do that? Well, these are, you know, a few examples of penalties where the structure of the graph enters a definition of a penalty. So all these ones are functions of beta. So beta would be a candidate model. Then for each candidate model, you can quantify something. So this is to replace the L1 or L2 norms, right? And then you could say, maybe I can use these penalties, put it in the optimization. And this gives me my inference model where at the end of the day, I will try to fit a model that explains the data and is small in terms of these penalties. So what are these penalties? I won't have time to detail of them. Maybe I will just illustrate the weirdest one. I mean, I know there is heterogeneous level in mathematics in the room, but for many people, this one is a bit strange to graph. Like if you have a beta, I say a penalty, a norm for beta would be the supremum over alphas, such that alpha i squared plus alpha j squared is more than one for connected genes of alpha trusses beta. What is that? It's a bit ugly. It doesn't seem to make sense for many people. But in fact, this one can be illustrated as follows. So this is the penalty I was referring to. It's just a variational form of a norm that is just a convex hole in high dimension. So what do I mean? Imagine, so here we live in the space of, the dimension of the space is the number of genes. So it's not two, it's 20,000. And what we do is that we design a shape in 20,000 dimensions in this case, just by saying that when you have a gene network, so you take all pairs of genes which are connected. For example, suppose you have five genes here. You take the pair one four, these are two genes. They are connected. And therefore what you do in this case is that in the high dimensional space, you just draw a circle with unit radius. In the subspace corresponding to dimensions one and four. Right? So you are in high dimension, but you can focus on just two dimensions. Dimensions one four, you draw a circle. And then you do that for gene one four. Then you do that for all the pairs. Two four, you draw another circle. Four, three, you draw another circle. And you end up with many two dimensional circles. And then you take just the convex hole and the convex hole is just the smallest shape, convex shape that contains the circle. So this picture just shows what would happen if you have two circles in 3D. Here you have a horizontal circle, a vertical circle. And the convex hole is like the Chinese lantern that fits the circles, but not more. Right? So if you do that, then this defines a norm. Like you can take this as the unit ball of a norm and this equation is just the definition of this norm. Right? And so now you can put this norm instead of the L1 norm or instead of the L2 norm that I said before. And if you do that, if you analyze the situation, what would happen is that because the shape is non-differentiable in some specific circles, you can show that it will lead to a selection of the genes. So when you minimize, you estimate a model beta and the solution beta will have many zeros just like the lasso. But among the non zeros, so the genes which are selected, you will see that many of them are connected on the graph. And the reason being that the solution will be on some of the circles which correspond to the fact that two connected genes are non-zero. Right? So this is just a way to change a bit the penalty by putting prior knowledge and pushing the solution towards not only, let's say, selection of 70 genes, but hopefully selection of 70 genes that tend to be connected on our network. Right? We change a bit. We put a little bit. You know, how genes may influence a lot which may be specific, right? Because, you know, there are these huge connections and they may have way much influence on that, yeah? Yeah, that's a very good point, yeah. So here, I mean, a variant would be to have weights, et cetera, but this is a very important question that we don't have any satisfactory answer. Just to illustrate, let's take back our breast cancer data and compare, for example, compare what would be a signature. So a selection of, here I think there are just 60 genes. If you use a lasso, so this is, let's say, a standard and state-of-the-art method to say I have, I start from all the genes, I select 60 genes. Here they are, so you have many genes. And then if you map them back to a network to analyze them to try to say what is a signature, then you would observe that a few of them tend to be connected. So here you have your ribosomal proteins, by the way. And they show very strongly as six connected genes, so you can start to get some interpretation. Like it seems that ribosomes may be involved, et cetera. But many other ones are not connected, so it's a bit harder to interpret. Now if you replace that, and you get performance of 61, if you replace that by just changing the norm from L1 to this convex shape, you get different signatures. And it's not a surprise, but of course the signature is more connected because you have put the knowledge of the graph in the penalty. So these are the 60 other genes, and suddenly you have bigger connected components, like your ribosome proteins are extended to a bigger subnetwork or pathway. You see appear a second big component, which are cell cycle genes, and a few other ones, right? So here, it's hard to say it's better because you put the knowledge of the graph and therefore it's mathematically obvious that you would get more connections. It's what you designed. So it's hard to assess whether this is really a good news that you observed that. You had to observe that. What we can observe is that sometimes it's hard, but sometimes the performance slightly increases. Here I don't put the error bars, et cetera, but you tend to have some slightly increase in the performance, which may be a good sign that choosing different penalties could have some impact. And in particular here, using some prior knowledge may drive you to more realistic models, which may be more stable and also lead to better performance. All right, there are other penalties, but I will not discuss them because I just want to say one word about, you know, I said there are two parts in my talk. One is regularized, so I hope I explain what it means and that there is possibility to define penalties. A second one is also to say, you know, maybe we're just stupid to say because someone measured 20,000 numbers, which would put that directly in our model, right? Because as you said, the geometry here, we say these are numbers, therefore these are vectors, and I use it in our model. There is a very strong assumption here and very some strong naivety in believing that this is a good geometry that we're... Well, the reason is more obvious, right? The reason is more obvious. You lose that motivation. Yeah, yeah, that's right. But so for example, I would just illustrate on one concrete case why we know that is very naive. So for example, you know, I've been showing this image many times, so I said these are like data you can collect, these are real data, gene expression data, but in fact, these are not the data that come out from the machine. For example, here, if we talk of macrosis, it's the same for sequencing. What comes out from the machine is first pre-processed by many things because we know that you have technical effects, you have batch effects, you have, you know, the sequences, differences in sequencing data, et cetera. So before coming to this matrix, in fact, there's a lot of thing going on. And one thing, for example, that is maybe one of the most standard pre-processing steps, like for people working in expression in macrosis, it's implemented in the RMA package in R, which is one of the most popular one, is that if you do some logical experiments where you measure gene expression over the same sample many times, in different days of the week with different people, you get different numbers, right? For many reasons, the weather changes, the, you know, it's very hard. So typically, it's not that these are just unwanted variations, right? These are variations due to technical effects and you don't believe it's biology. So to remove these effects, someone has to do some normalization as a pre-processing. And typically, one thing which is done is called quintile normalization. It's a transform that given, so imagine that here each box plot would be one sample. It's a distribution of gene expression over one sample. Usually, you don't take the raw values, you normalize them, so that at the end of the day, when you have many samples, you more or less have the same values, so not in the same order, right? It's not the same genes, we have the same values, but if you look at the distribution of values across each sample, it ends up being the same. This is called quintile normalization, right? Now what this says is that in fact, what you kept from the original data, the original signal, is not really the values, it's just the relative order of the genes. Because when you move from the raw data to the data you will use in your models, it's just the order that has been kept and then from the order, you define the new values that are constrained to follow a particular distribution. Right? So what can you say from that? Well, a few things, so first, the thing I will, again, I will be short now. One question is, here I say you transform your data so that at the end they have all the same distribution. First, it's not here what should be the final distribution. It's a bit arbitrary, right? You could say you could impose that to be a Gaussian distribution, a uniform distribution, an empirical average. There are many choices possible and there is no clear reason why you should decide one or the other. So for example, something we, these are different possible distributions. So something we worked on with a student of mine is to say that maybe this thing could be optimized and I don't have time to explain, but it's possible mathematically, don't look at the equations, but not only to say I first transform and then I fit a model, but you could say I transform and I fit a model and I optimize both the transformation in terms of the target distribution and the vector beta that defines the model, right? So in short, there are, it's a joint optimization over a model, sorry, I changed from beta to W, but it's just a linear model and the parameter of the transformation, okay? So it's possible to do. And the interesting byproduct of that is that when you do that, you need to clarify what's the link between the target distribution and your optimization problem. And in fact, there is a simple link which goes through a first representation of your each sample. So imagine this is one patient, one vector of expression. Then what you, I said the information you keep from this, from this sample is not the values. In fact, you will change the values. It's a relative order of the gene. So it's what's called a permutation in mathematics. You know, it's a ranking of the things. And so in our case, the way we represent this thing is through a, what's called a permutation matrix. It's a binary matrix that indicates what is the position of each gene, right? So in each, you know, in each column corresponds to a gene and each row corresponds to a rank. And you can show that if you replace a vector of expression like that by these matrix, then this is what corresponds exactly to, sorry, I skip the details, to optimizing both the model and the target distribution, right? So I don't want to go to the details of that, but for me, the important lesson here, and you know, somehow it works, but the important lesson here is that to make this work what we have done and what we do sometimes without realizing it, is that when we use gene expression data, in fact, we don't really use the values that were measured. We first transform gene expression into a permutation. So we move to a discrete setting, right? The space of permutation is called a symmetry group. And then when you have, let's say a thousand samples, it means you have a thousand permutations, and then you learn from that. So it's a new representation. And you know, one way to do is to do what I presented, meaning that from the permutation, we remap to the space of vectors and then we make a linear model, but maybe you don't have to do that. Maybe you can directly say what about, sorry, what about designing algorithms or methods or ideas that directly work on the symmetry group. So it raises the question of, can you learn on the symmetry group? And in fact, many people do that and are excited by that, right? So in short, even though we believe that we work with vectors of numbers that were measured, in fact, there is something in between that happens in many cases, which is working on the symmetry group. And just as an idea, maybe it's possible to directly start from here and design new approaches based on doing something. So here the question is, suppose you want to make a linear model, you need to define what is a linear model about the symmetry group. Of course, mathematics has a lot of tools for that, but it's not completely a problem. It's like a same-sorting problem, no? Not, I mean, maybe it can be formalized this way, but it's not. Okay, so I think I will just stop here, because I have still a long thing. I could talk more, but obviously I don't have time to. But just as a teaser, maybe the last thing I want to mention quickly is that I talked a lot about gene expression data, but there are many other data, like for example, we can look at these types of data which are more discrete and which, again, for cancer, so here, not many cancers, but this is a picture on breast cancer, which don't indicate the expression of the genes, but which indicate the mutations in DNA, right? So you could say, well, now I look at the genome of the tumor, I compare the tumor to the normal sample, make a difference, and I observe that some genes are mutated in a tumor, so maybe just looking for a given patient with genes are mutated, it can help me predict the risk of relapse or predict the response to a therapy. But here again, we have the same problem that we want to use that as input to fit a model, and it's again high-dimensional, and there is something more here which is that you see just from this picture that these are not just random vectors in high dimensions, it's a binary matrix with 99% of zeros. When you take two samples, randomly, they are basically orthogonal, they have no gene in common, maybe one, P53, but overall, it's very hard. So again, there is a very strange or complicated geometric structure which makes that if you directly say, I take that and I fit a model, cox regression, if you want to predict survival or something else, it just doesn't work well. And so some people, we and others, with André, Zinoviev, in particular, have investigated, maybe it's possible to change representation, say that instead of having a big binary vector, we should represent our sample by something else. So we propose something using the network again, I don't have time to explain, but to make a long story short, it's possible to replace the original representation by another representation such that at the end, maybe this is the only thing I can show, you can test empirically when you have different ways to represent your data and you fit a model how well they work. Overall, these are different cancer, the performance is not good, but the only thing I want to show is that in some cases, just changing the representation from the raw data to another one to another one as an impact on the performance. So here is the same data, is the same learning model, the only thing we change is do we start from a big binary vector like this? Do we start from a transformation, et cetera? And sometimes there is a significant difference in how well it works, right? So just to say that this is also a domain in high dimension with two samples and your initial choices of representations or regularization matter, right? They matter because they change the final performance of your model, so it's really worth thinking about it. We don't, I don't have any secret or it's all very empirical and based on assumptions we try, but it's a field that requires, I think there is still a lot of room for improvement in how, for example, to represent a profile of mutation. All right, I will stop here and I'm really sorry to be so long, but in conclusion, my message is just that there are challenges in trying to extract knowledge from genomic data and many challenges are not biological or medical, they are just mathematical, statistical. I mentioned regularization and representation as two ways, they are not independent, of course, to try to do something and more importantly, I think there is really a subtle interplay between knowledge and biology on the one hand and how we can put that in the mathematical or computational framework and the intuitions for why one of the position works or not is often a mix between logical reasons and purely mathematical reason. Thank you. We have time for one or two questions, that's it, okay? So for biologists, it's obviously very worrisome that the mathematics that you do with the data gets you different answers and the question I'd like to ask and I'm not sure if there's an answer is, is the reason why the processing, the mathematics is so critical is because the quality of the data isn't good or is it because the complexity of the problem is very high? Well, I guess it's to make sure of both. So I insisted not on the quality of data, I just insisted on even if the data were perfect on just the fact that, for example, you are in too high dimension and just a few samples or equivalently, so it corresponds to your second hypothesis, the fact that the model is too hard to learn, meaning that learning even a linear model in high dimensions is hard, independent on the noise. Now the fact that you have noise doesn't help. I think more or less, I mean, for me, I could say that has additive problems, like having noise makes the problem hard and having high dimensions makes the problem hard and the difficulty is somehow the sum of the two. So it's easier if you have less nasty data and if you have more data in our dimension. So I happen to routinely deal with LASA models and my experience with the data is prediction doesn't work. You look at which cases are hard to predict. There are usually a few bad apples, you throw those out, everything works beautifully and then if you zoom into those bad apples, you can justify it. Usually it's obvious error of class label. This guy actually died and didn't survive. I'm just wondering if in your experience there's some similar. Yeah, so I would say this happens all the time and I completely agree with you that in all, even in all what I presented, in fact, I completely didn't talk about some other preprocessing which is typically that we will move out layers, this kind of thing. Now this being said, I don't really agree on the, I mean you didn't claim that, but the fact that once you do that then probably solved. Then we enter the world of the other problem which is that okay, even if there is no outlayer and no noise in the data, we still have to learn a big model and LASA in high dimension is not robust, for example. If you select the genes, they will just not be the same even if you remove out layers. But I fully agree with you that practically speaking, this is very important to not to blindly apply these techniques to data, but just check what are the data. You always do PCA, you visualize your data and do something. Okay, thank you. Thank you. Thank you. Have a good day.