 Good morning to the session that will talk about machine learning, and so tomorrow will be again more detailed machine learning, but my job is to introduce you to the ideas that are in the line there. I don't expect that many people, is there someone who know a bit about machine learning? Ah, oh no, it's my teeth. So my name is François Labirette, I'm one of the four organizers of the summer school. I'm the director of the Big Data Research Center, and also member of the GRAL group, GRAL means group for research on advanced automatics with the spinaral, meaning it's a machine learning research group, and the acronym GRAL means Holy Grail. That's the trick why we need to map this Holy Grail to the local. Okay, so let's start. Let's start with explaining what is Big Data, because most of people are hard about Big Data, but don't really know what this is about, so let's start to explain it, and see what is the problem with Big Data, and then we'll focus on machine learning. Okay, so my favorite definition of Big Data is the 4V definition, so what is Big Data first? Because of volume, okay? And so you already know what it's made, because it's the memory of your computer, the memory of your artist is a gig, a gig, a terabyte, someone who really likes videos, may I have a terabyte of data? Data is something that you never see, X, Y, Zeta, and so on and so on. Each time you multiply by 1000, in fact by 1024, because it has to be on base 2, the binary situation, and most of the time when I speak about that, people don't really understand how big are those things, so I've answered it on the internet, the allegory of the rice grain, okay? So suppose that a bite, which is basically a sequence of 8, 0, or 1, a bite, suppose that it takes the place of a grain of rice, this is not the case, of course, much, much, much smaller in your artist, but suppose it's a grain of rice, then what is a king? It's a god of rice, right? What is a meg? A meg is 8 big bags of rice, right? And what is it? Again, it's 3 big trucks of rice, okay? Terra is 2 big container boats full of rice, not supposed to be in fire, okay? And what is petabyte? It's basically if you cover Manhattan of rice, okay? And Xabyte, you cover the west coast of the United States of rice. What is a Xetabyte? You feel the Pacific Ocean of rice, okay? And what is a Yota? It's basically a ball of rice of the size of the person, right? And the companies that are dealing with Xetabytes, so with the Pacific Ocean, Google, eBay, Amazon, Facebook is not so far from Xetabytes, okay? And this is explained why they are all in the west coast basically, because the ocean... And so this is sort of a problem, but the problem is not only to define what is the problem with big data. The system problem is velocity. Velocity means that you are now data that are gathered by sensors, by eye-troupled technology that always come to your storages. The facility and you have to deal with this velocity of our arriving of always new data and so on. You have to take it into account at the time when the data is coming. Look at it automatically and put it in some place where you will be able to find it if you need it, right? The third P is that variant. Now we have some data that comes from text, video, internet of things, omic data. All those stuff are not supposed to talk to each other. So how to make this variant something that can be intelligible? This is the third P of the big data problem and that one is velocity. You have eye-troupled technology, yes, but it's much more noisy. You want to look what happened on the social network? There are some, what is called... Fact. Thank you. Thank you. So you have to deal with problems, okay? And on my side, it's what you deal with those things, not necessarily all of them, you deal with big data problems. That's meaning that basically you are dealing with a big data problem when you are working with microprime and omic data and so on and so on because of the variety of the stuff and the fact that it's very noisy, right? But you are more partially dealing with what we call fat data instead of big data. And this is a very, very complicated issue for a data analysis. Only in big data you have a million of examples on which you have some feature that characterizes all examples. And so it's easy to make machine learning calculations and we don't care about making errors or things like that. So fat data is exactly what you are dealing with. You have a few examples, a few patients on which you have the old genome plus some arenas, plus the clinical data, plus the microbiome, plus, plus, plus, plus. So we have a lot of things that you know on each patient, but you have very few patients. And you want to do statistical learning. Machine learning is based on statistics. So this is really a big problem that we will try to address today, okay? We know that in life science, fat data is an issue. I don't have to convince you about that. Okay, now machine learning, okay? This is the Ducal Way, a venture grant that exists for about 10 years that explains in some way the new reality of big data. So basically, in traditional situation, you have the domain expertise, people who are working on genomics, for example, that make some experiment and then they go to see a statistician or a mathematician to construct a model that will explain when they bomb, okay? What happens with big data? The problem is the computational issue. The data is too fat. There are too many things to observe or to believe, okay? So what you have got is this problem of consideration and then this is why machine learning starts to be interesting. Okay. I will explain something about the dangers on today. Just a cartoon for things. We found that the correlation in the data, everyone who shaved his head have an increase in their sale. So the bus say everybody take a razor, right? So this is the kind of false decision that we can make if we are not aware of the danger zone, okay? But I will retry now to give you insight about what is machine learning. I'm not thinking that you will be able to be autonomous on machine learning after my talk, but you will understand what are the principles and what can be done and what are the danger zones, okay? So machine learning 101. So first definition. I like one of our fields. The field of study is the computer. How do we learn without being explicitly programmed? So the idea is in machine learning we are going to learn by examples, right? We will give some examples to the computer and we let the computer infer the solution, okay? So for the days, both of you have some reviews from Amazon about films, right? And you want to know if those reviews are positive or negative because you don't want to see a film that has a negative review, for example. So how can you construct a predictor to understand what is the sentiment into a text? The text for a computer is a sequence of letters. It's not something that has some kind of interpretability as we have, okay? So what is the trick is you give to a human a lot of example of text from Amazon and you ask him to label plus one if it's positive, minus one if it's negative, okay? And then you give to the learning algorithm all of this data. The learning algorithm constructs a function, which is a classifier and the term is the classifier, but this is basically a function that we take a new example of the text, not being seen by the human expert and will infer most of the time we do it correctly both if it's positive or negative, right? More precisely, you have training data that can be text, you manage, omit data, mass spectra, mix of them, anything. And what we are doing in machine learning, we are trying to convert this data in a sort of way into a vector, okay? For example, if it's text, what we can do is you look at the dictionary of English for example and you look at the first word of the dictionary, it's A and you count the number of times A appear into your text and this will be the value of the first coordinate of your vector. Then you look at the second word and you count again and that's it. This is a pretty stupid way to convert the text into a vector. There are better ways to do that, but you have to be happy, okay? Then what we can do is we acquire a human expert or find a way to have a label for our training example. Then you give everything to the algorithm. The algorithm gives you out a predictor and for any new times that you will see this kind of data, you will have your predictor. That will be most of the time correct. This is the idea. Of course, you can predict in a type, for example. I'll be based on that. You have a batch of patient that has a problem and then from tool group, you take a batch of different data and then you make the learning algorithm find out if you have sufficient and sufficiently enough label example that this will work, right? Okay. There are many different labels that we can consider in machine learning. The one that we will focus on will be plus one, minus one for today. But you also, because you are comfortable, you are a patient. But you can also have Wunsch class classification. You can have text and you want to see if the text is about sports, politics, economics, and so on. So you can have many classes. It can be a real value. When we call it regression, you want to predict the weather. By the way, the summer is beginning during my talk. Because it's an odd subject, we are supposing to have a weather after my talk. Summer will be out. Okay. But you can also have a very complex object. You want to predict, for example, a pet part. You want to predict a sentence in English. For example, in question answering system, you ask a question in English and the learning algorithm gives you the answer in English. So this is a structural problem. Of course, this, we are going to do that. This, we still have something to write. Okay. So in classification, I recall that you always have to define your example through a vector space. The vector space is about 152. It's the value of G1 for some kind of value in the value of G2. And this gives you an example. And you have examples that are comfortable and are passionate. And you want to find a separation between the bad ones and the good ones. In regression, what you have is you have a value that you want to predict. So you want to find as far as possible a function that will be about able to predict your stuff, right? Okay. What should be the task for the machine learning? Okay. So we want to make a predictor that will predict plus one or minus one or will predict the regression value that we want to see. Okay. But not an example that we already have seen, but an example to come. Okay. So to do so, what we should look at? Of course, we have to make fewer data on the example that are given in the observation that we have to learn. But if we only focus on that, then we will overfit the data. Meaning, and this is particularly true in the situation of fat data. For example, the color is not very good, but there are some bunch of red, there are a bunch of blue, you want to make a predictor that will decide which is the area where you have to decide blue and which are the area where you decide red. So here is an example of a predictor that makes no error on the training set, on the example that I've seen. There is no error, believe me because the color is very bad, but there is no error. Okay. And so from the point of view of the training set that I have to learn, I'm perfect. But you can imagine that if I have to classify an example here, I'm better, okay. I have blue, red. Okay. So if I give you an example there, my predictor will say blue and probably I will be wrong. What is the problem here? I too much focus on what I've seen. Okay. Probably it will be better to accept to make some few errors on blue but have a much more simpler function. Okay. We call the old camera either if you can do a simple thing, it's better. And it's working well. It's better than to have a very complicated one. And there's some theory about that, if you guarantee. You also see this overfitting underfitting situation. Okay. You want to fit a curve on, this is not machine learning. This is basically a pure math. You want to fit a curve on your data point. What you decide, if you are considering only lines as possible function to fit, this is the best one. This is not good. You are not good in the point that you have seen and you will not be good when you will see new points that are not in your data. Right. On the other side, if you take, for example, a polynomial of degree 15 in this situation, then you are very good in your training set. But if you suppose that this is basically the true rule that you want to fit, then if you have point here and here, you will not be good. So, for example, to curve, you are not good. For example, you are seeing you are good. So all the trick for machine learning is to find ways to produce this. Not out-underfitting, not overfitting. But unfortunately, we cannot see this true line. This is not known. So this is all the problem. Right. Everything is clear until now? Good. Okay. Now, I decide to present you some learning algorithm. I won't get too much into the data, but I think it's important to see how it works and what kind of predictor you can construct with that. I'm not sure that you will use it by yourself at the beginning, but you can then ask someone that is a specialist in machine learning to help you on that. But to understand a bit how it works, I think it's a good point. Okay. Now, what is neural network? Everybody has heard about the current? There it is. So the word is known. Now, you will see what it is. Right. So what is a neural network? A neural network is simply the following. I recall you that I say that you have to transfer, at the beginning, you have to transfer your data into a vector of dimension B. Right. So in the neural network, the layer, the first layer is simply your vector, your input vector. Right? Not more, not less. Okay. And what you are trying to do when you try to learn a predictor from the neural network point of view, you want to know how to transform this vector into a new vector that will be a new representation of your sample. But if you work correctly, this new representation will be more adapted to the task that you are interested in. I don't say you for now how to do that, but the task for the neural network is to find a new representation that is much more close to the task you are interested in. Okay, so for example, this is an image, and you want to learn a representation that will help you to decide if there is a cat or a dog in the image. Right? And if you are doing it properly, if this is the suitable representation, then each neuron has a positive or a negative value. Okay? And so if it's positive, you think that it's a cat. If it's negative, you think that it's a dog. You know. Okay. And what is the idea of that? If then you can safely take a majority vote of what the output neurons are thinking. If the majority thinks dog, then you will predict dog. If the majority is thinking cat, then you will think cat. And it's a majority vote that is not democratic. You will put more weight on neurons on which you will be more faithful that they are good. Right? And how can you learn this W and this V? The trick is relatively simple. You take your first example in your training data. You put it there. And you will initialize, for example, W as the transformation where everybody is at one. Okay? And V as the transformation that is the democratic vote. Everybody is at one. Okay? So you put your first example. You look what the W transform is doing into your output layer. And then you take your majority vote. And you look. And your prediction is a cat. And true value of your training set is a dog. So your neural network is making an error. What you are doing, you will do somewhat called back propagation by gradientation and many mathematical tools to readjust the weight of the W and the V. And then you take the second example. And the second example predict correctly. Okay? And then you take the third example. And if the third example is not correct, you will adjust the weight of the V and the W. And you repeat that millions of times until you have some kind of convergence. That's it. That's all. Yes? So what's the rule that you also have to work first on the training and then on the testing? No, I'm only on the training for now. I will explain you all when the test set is happening at the end. Now I have some training set. I reserve some test set. But I have only training set and I'm only working with the training set in order to find the correct value of W that will give me the suitable output layer. Yes? There's a lot of difference on that. You can keep some kind of validation set that looks how you improve in this validation set that is not exactly what you use for training but some kind of intermediate test set between the training and the test. And then there are some early-stopping strategies. A lot of difference on the test and you have also a lot of possibility of number of neurons on the number of weights of W that you allow that they can be non-zero. A lot of possibility of constructing different neural networks. In fact, if you like LEGO, you will like the journey because you make a lot of construction on it. You have an idea that you put your car exactly like you were doing when you were young when you were building some kind of bird pod based on the little stuff. It's not so easy to manipulate but it's not so difficult neither. Okay. Everything is fine for the neural network? Yes. Just a quick question. Isn't it too early to ask about the scale of data and the size of it that it's needed to actually achieve? Yes. It's a good question. With millions of examples, we are very in a confident area with neural networks. With lower on that, we have to be wise but we can manage. With that data, it's a niche. There's a lot of people who are working on that to be able in health science to make use of deep learning as well. It's not something that is being done yet but I think since everybody knows deep learning as if it has to be the first algorithm I showed you. Right? That's okay. So just remember that we try to learn a new representation of the data that is more suitable for the task. And we try to do it by observing the data and adjusting the W and the V weights. Just a quick question. Should we always use the same number of possible controls? It's always better if you have balance but it's not always the case. When you are in an unbalanced situation, so a lot of, for example, control very few patients of Resvesa, there are some techniques to adjust. Okay? But it's always more difficult when you have the unbalanced stuff and stuff like that. And then you just don't have, you cannot just look at accuracy because suppose you have 99.9% of your data that is plus one and the rest is minus one. So the best predictor you can have is always say plus one. Right? So then this will be very accurate. It will make only one person, 0.1% of error, right? So then you have to take it into account matrix. It's not only the number of error that is an issue but the balance of the data and so on. Okay? But I mean, it can be a year of explanation of what is machine learning. I have one hour and a half. Okay? Yes? Okay. But is it clear for that? Yes? There's a little thing that I didn't say. It's when you make the transformation, so you take the value of X1 times the value of this weight that you put on the arrow plus the value of X2 times the value that you have on the arrow and so on. And this gives you the neuron. But you don't leave the neuron as it is. You have to do some kind of function that is not linear because this is linear transformation and to work, you have to combine a linear transmission with a non-linear function. The reason for that is basically because of deep learning. If you don't take this, then you can combine v and w into a simple majority vote that goes straight from there to there. Because a majority vote is a majority vote. It's a majority vote. So you have to make some kind of break between the linear transformation that is here and the linear transformation that is there. And the way you are doing that is you take the value that you obtain and you take the soft mass, for example, or you take any non-linear function. And this will allow you to have many layers. And deep learning is a neural network that has many layers. I tell you about representation. This is the first problem that has been solved by deep learning. It was in 1991, I think, by Jan Leuchter. And so what we realize is for the task of finding which digit we have, this is not an interesting representation. The algorithm finds that it's basically what is in the middle of the pie that is really deterministic to decide that this is a pie. It is that algorithm. It's not me who decided for the algorithm. But the algorithm will say, a good representation for a pie is not this. It is that. Why? Because think about it. If you take your pie and you just translate it a bit. From the point of view of the matrix of pixels, they are very different. But they are both a pie. So the suitable representation must be more... You must not be very far apart because you may just make a translation or a small rotation of your white pixels. And you see also that you start with the black and white matrix and you have gray matrix there. In another situation, you can see this is a deep learning mechanism. A convolution deep learning mechanism for people who are involved. And if you can see that in the entry, it's about facial recognition. In the entry, what you have is the matrix of pixels. That's all. And then you try to learn many different layers. And what they find, of course, I think that people from deep learning work a lot to find this example because it's beautiful to see. I'm not sure that it will be the same in every problem. But what we are doing, they realize that in the first layer that you construct, that the face is now transformed. And the algorithm is looking for, in French, the lines that you have in your face. And when you continue to learn layer and upper, then the algorithm tries to identify those eyes. So this person now is a combination of this eyes and this nose and this ear and so on. And at the last level, the one that is suitable for making the decision, the person is now replaced by a combination of artificial faces. And this is why it's robust now. Because if she starts to smile, for example, or she just turns her head as a matrix of pixels, they are two different matrix of pixels. But they will be basically the same combination of those artificial faces. So this is the strength of deep learning. It learns representation. And here is an example where deep learning is very, very useful, it's for imaging. So I recall you that for a computer, this is a matrix of pixels. For a human, what is this? For a human, it's a thousand words. A thousand words. So how can we transform this matrix of pixels into a storm? And this is exactly what machine learning is about. It's to make some unstructured information and to start to structure it in some way. So this is a car, this is a road, this is the sky. Of course, we make some errors. This is the sun, this is the princess, the pool, which we haven't gone through all that often. Okay, in Toronto, too? And spring, at least. So this is exactly what machine learning is about. It's to take information that is not well structured for what we want to do with it and try to start to structure it in a more intelligible way. And you can see now what we can do. This is quite impressive. This is from Stanford. So, automatically, the predictor said that this is a woman in white dress standing with a tennis racket to people in green. There's no inner intervention. You can get the picture and it gives you the story. This is another example. A dog, they catch with the white ball near a wooden fence. Okay, there are some million, turning example behind that. But I will say that this is a risk. So for an example like that to work, we have fed this system. Millions of pictures of dogs, millions of pictures of born, millions of pictures of tan, and then... That has been labeled manually by humans. In fact, what they did, they made some kind of game. Okay, so you were logging out a picture and there was another person in the world that was logging on the same picture. And you will earn points if you find information in the picture that the opponent didn't find. Basically, this was the idea. Because I don't see a master student. Yes. But would it just be millions of pictures of women and dogs or would it also be with men and... Men and cars. That is what we call the image net. You can use it. You can see it and see how it works. It's all the story, all the image that I'm trying to find. Of course, this is not only learning. There are some kind of what we call segmentation of the picture. There's another kind of algorithm that we use to prepare the data and so on. But basically, deep learning is a learning. So deep learning is very good for image recognition. It's basically the data of the art. The same thing for video processing, natural language processing, speech recognition, alpha... It's a very nice game. It's equivalent for... that we have for chess. But infinitely more difficult. It has been the most difficult... For people in artificial intelligence since the 80s, everybody said that this was an impossible task. As the AI task. And last year, there was a computer that beat the best to go there. So that's impressive. And there's a lot of other things. But we are in the presence of that data. Isn't that your question? So... The tendency of repeating is there for... It can be controlled, but it's an issue. And there's another problem with health science in particular. Deep learning acts as a thing. The black box. You have 100 million of neurons. Everybody can understand what is the underlying process that makes the prediction. This is something that is bad for us. Because we want to understand why the predictor says it's a standard one. So, is there an alternative? And yes, there are alternatives. And I will show you some of them. And tomorrow, Professor Adilka will show you another one. And Professor Machin will show you another one. So we will have some kind of overview of what can be done. So the first example that I want to present you is the kernel data. So what we are doing in that kind of approach, we are looking for a predictor, a classifier, that is a linear separator in R2, it's a line. In R3, it's an hyperplan. In R4, it's... it's... Q. In Rd, d can be the size of the buildings. It's... they mention d-1. Okay? And why this is interesting, it's because the following is you have an input space and you want to look for a very complicated decision process. But this is algebraically very difficult. So the trick is you take your input space, your x1 to xd that I give you and you project it in a much higher dimension space. But in this high dimension space you restrict yourself only to hyperplan. So it's much more simple at the level of the map. But you are in a much more high dimension. Meaning that this hyperplan when you see what happened in the input space it can be a very complicated decision process. Right? And why this is a good idea, but an hyperplan you remember your algebra, high school algebra an hyperplan can be defined by what is called a normal algebra. Here it's an hyperplan much in one, so you take something like this and we get its plan you have to have this normal stuff and so on. And why it is interesting if your decision process is an hyperplan it's because if you have an example in your very high dimension feature space you simply have to make the scalar product very easy calculation to make and if the scalar product is positive then you know that you are in the same side of the hyperplan so you can do this and if you are in the other side then the scalar product will be bigger here and also another point that is important in the kernel method so we are looking for linear separator but what kind of linear separator we should choose because sometimes we have many choices so we will try to find a linear separator that will maximize what is called a margin the margin the demilitary zone between the positive and the negative there is nobody in the training set in the margin so here you see if you take this frontier this guy is very close if you take this frontier this guy is very close the best way to do it is to put the maximum possible demilitary zone what is called a margin it doesn't prove that this prevent very efficiently what we have more example one formula basically what the super vector machine which is one of the kernel method the most popular I would say kernel method is doing is the following I recall that W is basically the thing to define the hypergram that you are deciding so what you try to do basically is to define this is basically the norm of the vector W that you have to define and this is what basically the number of errors for the number of points that are inside your margin okay and why is so this cannot be negated but if the two label is plus one and you predict plus one this value is zero if you predict plus one and the real value is minus one then you will have two an hour right and so this is related to the margin I won't explain why but the size of the W is really related to the margin and the F of course is your value the scale of product that you find you dive with that so what is SVM something that is thinking into account as far as possible good margin but without doing too many errors roughly say that this is it and so the hyperparameter C is how much you can accept error and how to define the best C this is something that is an issue that we will see at the end and if I don't have time I will show you at the time but basically you take a C that is very high so you really don't want to see errors meaning that you will have probably a smaller margin but about no errors if you take a smaller C then you will generate some errors if you can have a better margin which one is the best depends on the data good contract for everyone ok basically the feature space the big space on which we are projecting our input space can be really really huge it can even be infinite dimension but in most of the time suppose I did something that dimension goes to millions something like that it can correspond to the features you can decide that the projection is simply a noted function and the input space in the feature space this is the linear kernel but you can do much more complicated stuff and there is a bunch of kernel that exists and I will show you that also in that new kernel that is more added to some specific data ok the trick is called a kernel trick is even if we are in a very huge dimension the calculation basically can all be done with input space if we obtain what is called a kernel what is a kernel, a function k is a kernel if it simply corresponds to the scalar product in the very huge dimension most of the time such a function exists so all the calculation is not in the space of a million dimension but a space of much smaller dimensions for efficiency this is very important and also what is interesting is for people like you this can be viewed as a similarity function if x and x prime are similar meaning that they are basically in the same direction in the very huge dimension ok so this will take a big value if they are very different meaning that in the feature space they are at 90 degrees then you will say that they are very similar if it is on the other side then you will have a kernel so people can see that as a similarity function ok and this is important because we are talking about we want to get out of black box prediction so similarity function something you can understand most of the time right also what is very interesting is we because of the kernel trick the prediction function that we want to find has a linear combination of the similarity function between example in your training data and example to predict on ok so you will decide that x will be a plus because it is much more similar to example that are plus in my feature space so this means that those guys will be i value and it will be a minus if it is more similar to example in my training data that are related to mine but of course in the feature space the idea is if you have not too many alpha i that are non zero or not too many example this can be interpretable not the best way of being interpretable but much more interpretable than for example the parameter and there are some techniques to make as many alpha i as possible as zero and ensure the interpretability for example there is something that is called relevant vector machine that only take x i that are relevant in some way I did not understand what is alpha i alpha i is a weight everything is about the measure we developed right alpha i is a weight y i is what the example x i is thinking of suppose all alpha i are one for example so you are basically democratic every example in training set has the same interest for you then you look at how many what x x is close to what kind of example the example that are most of the time minus one then what will happen to the sum it will be negative so we will predict minus one if it is more the x is closer to the people that are plus one then the y y is plus one because this guy is taking plus one so the measure is taking that so basically alpha i is the weight of the majority vote is it okay? it is not something that I want you to understand completely just picture okay so it is to make all the calculation we don't know anything about the big picture space that I tell you what we have to construct is called a Grammatrix what is a Grammatrix is simply a matrix where entry are of the form x i and x j with x i and x v two examples of your training set as an example of a Grammatrix a Ray Surveyor you are y'all yeah as an example uh oh you have seen that? this is a Grammatrix so Ray Surveyor is constructing a Grammatrix so okay but to continue what is very interesting with the Kernel method I don't expect that the tree next slide you will understand it fully but I want you to understand that what is very interesting with Kernel method is we can work with people from a domain for example to design new kernel of course there is a lot of Mathematica that has to be ensured everything to work but we can speak with you and understand what kind of virological underlying your problem and we can construct a kernel that will include this virological and why this is so important this is still ready to be together because we are in a fat data situation we don't have too many examples so we have to be very very efficient on each example that you are giving us how can we be very efficient if we can put virological knowledge inside the algorithm then we can attach to the algorithm that's the trick and that's very important I answer that way tomorrow I will tell you also this kind of story this is the idea when we can help you on that it's because we can help you to design a more efficient algorithm by putting knowledge in our algorithm I'm just wondering if you can say some simple example of what type of virological information is used for you the gs kernel the gs kernel that we have designed in the ground in our laboratory the principle was to construct a similarity function between that type from the point of view of the capability of a peptide to bind to a code this was the test basically so we say well we are a peptide and we have a binding side protein and we want to be able to see what kind of when two peptides are quite similar from this binding drop it and so then so it's agree but it's not so difficult if you understand but I will focus basically on what was the biological ideas that we tried to input into that and all those ideas correspond to some part of the formula okay so the first thing is we consider that we count all the substring amino acid for metimetrician likely it's simply a string right with an alphabet that has only 20 letters that's all so what we do is we look at all the sub word that peptide IAS and all the sub word that peptide we have and we are happy if we find the same sub word at the same place okay and we are happy if you find the same sub word at the same place and the word is the sub word is bigger okay basically okay that's it but we can see that for example the pooling of the leucine has very similar property from the binding point of view so what we can say is if we have a P and an L we are not exactly a match but probably that if we just substitute from a peptide a P from an L we will have similar behavior from the binding point of view also this part of the problem okay so the physical chemical property of the peptide of the leucine is very similar to the physical chemical property of the P so we will not penalize too much this fact because you know because of this minus if we penalize too much this could use if we have L and L here this is zero so we have P to the zero we have one which is the biggest possibility that we can achieve for each possible sub word also what we can see is if you have a very big sub word in the first peptide that is predicated of the second peptide but not at exactly the same place then there is a possibility that the peptide will not bind at the same place but they will bind both of them so what we say is that we will penalize the fact that the sub words are not starting at the same place on both peptide but not put zero it will make some kind of big key function again the minus means that if you are very far apart it will not come and of course we don't really know how eager we have to be with those ideas so we put some hyperparameters that will be decided that the value will be decided when we will see the problem I don't want you to understand anything about the GS kernel especially not that and especially not that this kernel has the mathematical property that is called positive sub definite it is the reason that there are some exponential in that but you see the idea we discuss in fact we discuss with Jacques and the team of Jacques and we say can you explain us exactly what you need for that and we arrive with the same idea and so we define it and it works quite well yes I have a binding site that is interesting for me but what I am trying to tell you is defining a kernel is defining a similarity measure between peptides so I am looking only on the physical chemical property to decide that the peptides are similar on that and opening that I am right and it will correspond to similar value for the binding I think right you cannot machining is a no free launch you cannot have an algorithm that will do everything for everyone you have to take care but I get some kind of manual I have some kind of value that I can deal with suppose that the key is a very bad idea for a specific task of binding affinity then normally if I am looking correctly at the data when I try to learn I will find that I have to put a similar value of a very high value so the d-key will not come if the d-key is a good idea I will manage to have a similar that is small right but everything will work and on every problem but it works quite well in fact well enough that we won in 2012 the Dana-Farber competition about binding with NSC so that was quite good right so does it mean that for every question you have to optimize the channel yeah for every every problem you see in SVMA you have the hyperparametric C that make the trade off between the margin and the errors now I have an hyperparameter Sigma Sigma C, Sigma P so this is something that I have to deal with when I will see the data yes and I have no choice about that I need to be able to have a very different possibility of making this algorithm and when I will see the data I will show you how I will try to decide the value of C the value of Sigma P the value of Sigma C and so on okay, good I didn't lost too many people there are someone who has nothing I mean you understand the idea I think what you should mention though is what is happening for and sometimes something comes out like a crab yeah in fact to design this kernel it takes about a year and a half to be able to prove everything on that and there was a lot of other issues that I don't tell you about about the efficient of the calculation that we had to work on that because we see the formula was quite complicated and so as I tell you Resurveyor is making Grammatrix so Resurveyor basically is trying to find some similarity function between two bacteria so this is the Grammatrix so you can use Resurveyor of course you have to translate the value of the color into a number but Resurveyor is translating the number into a value of color so you it's in the memory of the machine okay so is the information there has to do with those cameras basically how I'm just trying to think about the biology part that was fed into the in Resurveyor for example you can have the GS kernel is much more complicated but in Resurveyor what you can see is for example you have a hit if you have the same camera at the same place and you count the number of hit you have into your two bacteria and if you if the number is high so kernel can be designed in a very simple way in a very complicated way and so on the idea I want to tell you is you can think about it's possible to design the kernel in order to be able to increase the ability of the algorithm to find rapidly a few example predictor that you are looking for okay another alternative is call Encel and simple Majority I only one side about that because Mario Marchand tomorrow will talk about that but what happen if you have learned many classifier that not very good but you have a lot of them possibly because you only each classifier has been learned on part of the data where you have a very very very big data or you have a very fast algorithm but it's not really good so you have a lot of predictor and they are very bad each one can you make them work together so this is the idea of Encel and simple data okay you can have for example I have only predictor that can be defined as horizontal line vertical line or no line at all can I do something more interesting by combining them yes and this is a democratic vote but I am allowed to have a non-democratic vote if it's better right and you probably have heard about random forest this is the most known case of Encel and simple method any group for my lab it's one of the algorithm that we have designed with Encel and simple data right okay the most interesting I think for fat data it's algorithm that looks produce forest classifier the algorithm it's called the circumvent machine that has been developed by Mario Marchin and John Schroeterer in the early 2000 and the idea of the algorithm is look for a classifier that is it's a rule based classifier so it looks for rules but it looks to define a predictor that is based on the fewest possible number of rules okay so make simple as possible of course we can constrain that we cannot make too many error on a training set otherwise we are on underfitting something okay that's the idea and there are some theory called simple compression that prevent the problem of the fat data you see this morning Maria tell you that the p-value to be tuned has to be 10 to the minus 11 in order to protect herself with simple compression approach we can go much better than that I won't tell you about that I decided five minutes before beginning not to show you the problem okay but there is some interesting thing that is called the simple compression scheme of course such a predictor do not always exist if it's not there is no such good predictor go back to the roundabout or go back to the roundabout but if such a classifier exists we have good guarantee that it will be good in example 2.com we have this called the open-braider idea so this is very important to say that with this set-coming machine approach you are not sure that you will find something interesting but if you do then you are really in business because it will be interpretable because it will be a small number of rules that you will be able to understand and also you will have good guarantee for example so interpretable and very very very fast so may I present you covered that is the set-coming machine that has been adapt for genomic data what is the idea so we are thinking about genomic data so we will consider some string of size K and what we are doing is we are looking in our training set and we are looking at all of the K-mer in at least one example in the training set and this gives me some kind of dictionary and then for an example X so you put your dictionary in order and if for this example if the K-mer K-Gata K-Gata is present then you put a one in the position respecting this K-mer then you go to the second K-mer of your dictionary and it is not there so you put a zero there and so on and then you construct what what I tell you that we have to do at the beginning perfect for bacteria it can be a vector of dimension one of a million so this is real fact but the the idea is there and what is going to be the sent covering machine the sent covering machine will look at all possible rules of presence of absence of a proper covering so one million possible K-mer so we have two hundred million possible rules right a lot of rules we can hang ourselves with that and what the sent covering machine algorithm is doing is look at all the rules look at the data and now what he is trying to do he is trying to find the smallest conjunction of rules that are not making too many errors I don't explain what it means to me okay but the idea is there it's okay and of course we have a out of current limitation meaning that we can run stuff on a laptop in a very fast way the ideas feature selection is not really required we are not limited ourselves to for example for human genome two snips we take the old genome and all the possible K-mer so we don't try to make feature selection we don't want to say I'm sure that this should be important because we laid the data aside for us and we have a theory on that but I won't tell you anything about this so here is some experiment we make so we take some family of bacteria and some family of antibiotic and we will try to see if we can predict correctly if getting the genome of the bacteria if it is going to be resistant or not and so the example the assembly has been done by our genome assembly okay so we had a problem at the beginning because a a succumbing machine don't care about having 100 million features but the other algorithm are not so happy with that so for the other algorithm what we did is we take a key square test on all the 100 million K-mer possible and we take the 1 million as the best 1 million as the best data and we only give this 1 million features for a succumbing machine not nothing and so what happened we are accurate so if this is decision tree this is SVM I told you about in your kernel and this is another SVM so with the competition in machine learning you are quite accurate most of the time very good and very good means 3% error that has a few hundred examples that's quite good but we are so fast and scalable meaning that this take time but this has been complicated and sparks so we repeat that for each experiment we repeat 10 times and we take the the algorithm basically the succumbing machine needs less than 4 rows so this is really interpretable so you have this I will do resistance if I have this this K-mer and not this one and this one and this one so this is something that you can give to an expert and expect in the antibiotic resistance and we did that yes so we talked a little bit about the Naive Bayes classifier 16S I was just wondering if this like you could enter a problem for Naive Bayes because it's really good I didn't try Naive Bayes but my experiments say that if linear SVM is not better than SCM I will be very very surprised that Naive Bayes is have you tried Naive Bayes for fun but this is something that we could look at yes but your Naive Bayes is not interpretable it's not interpretable your ruler will be on 100 million of future but yeah I don't say that we are the king about antibiotic resistance we are interpretable and we are quite good at it and this is based on patric dataset and people from patric dataset they don't really be able to beat us but I agree that we didn't try all of the school algorithm okay that's good and sometime it's only one chemo yes are the chambers output with any sort of weights not for now but this is future work okay because you have an importance for the contribution of each chemo to the model so how often does it affect the but those results are not considering that but we have some kind of ideas also we would like to see about some kind of how the family is spread out and so on new ideas but this is the kind of variation yeah I'm just wondering if you can answer that sure this is something that we want to do yes the tenacity are about 200 300 and the biggest one is 12 so it's one hundred and two thousand but it's fat data with slightly big data it's not a 50 example but nevertheless it's not so big from the machine learning point okay also we try to say what happens if we only give to a secondary machine the one million best feature that from the point of view of C value and what we realize is most of the time it's the grade we make selection give him less possibility but the best one and we are not good and why this is so basically because it's a conjunction of feature and it's possible that even the first feature that's very interesting the second one might have a very small p value but when you combine with the first one then it's important right we are not looking at future one at a time we are looking at future quality okay and this was a surprise we are a sparser most of the time we are not respecting the secondary machine to a specific small number one million being small number of right and also what we did is again we look on the chemer that we have found where they are see where they are and sometimes it's on some gene that to an expert and they tell you that well we discovered this interesting result we discovered this one by the way I think Alex this is the one that you will so you will try to re-calculate what we have done on this kind of situation and so we make a lot of re-discovery for a few hours of calculation right and of course there are some smaller parts that we don't know about and well your information I'm not sure it's not something that I can find so what can happen there first it can default discovery that we have done we think that our discovery is not a chemer that is based on this gene but it's not true this is called a false discovery the second possibility is we find a chemer that is really correlated with another chemer that is really the cause and because we are in statistic we cannot infer something else the Turkish situation is very interesting but now you have to complain people to ask for money to verify the discovery or but this is I think a good idea to have and it's important because you see the loop between the machine runner and the scientist and so this is the idea also what we did is the following when we find the best chemer we remove it from the future space and we rerun the algorithm and see what happens until the prediction of ability degrades if we can rerun 100,000 of times what it's usually we find is basically it's a gene old gene that is really responsible for the resistance but sometimes after 3, 4 or 5 removal the degradation is happening so we will see what happens and often we find dedication by this way and also we have a platform that on which you can see our result you can also see what is the predictor that helped you about that where are the chemer in the gene and so on and so on so if you have fun we give you the possibility to do the prediction I finished then what is the time I have that 10 minutes so the proper use of machinery this one will go fast because Alex will develop to do it with you after that but the base idea is you have a big amount of possible learning algorithm and for each learning algorithm you have a lot of possible agro parameter to tune so this means that you have a lot of possibility and if you want to do it properly what you have to do is divide your data into a training part and a testing part you put it away this is only for the end of the process and then what you are doing in your training set you divide it into some 4, 5 fold 10 fold and so on and what you are doing is for each algorithm and each set of hyper parameter that you would like to use you run it 5 times the first time on the first 4 fold and you look how good is your algorithm and this will give you the test accuracy of that fold ok and what you do, you repeat it for each possible test fold and you the cross-validation score is simply the average of those 5 if you have 10 fold then it is 10 and now what you can do is you do it for each algorithm and then what you are doing you take the one that gives you the best test score and is it the best one this is your best shot because you can overfit that with this situation but at least at each time you didn't your algorithm never see the data on which it was tested and when you find the best algorithm and the best list of hyper parameter what you are doing you rerun but now on all the trainings you rerun your algorithm with the set of hyper parameter and you decide ok and on the train and then you test the predictor to your thing and only this one otherwise you will have to deal with p-value of 10 to the minus level ok even with this particular method things can go wrong for example all the data has been gathered all the machine learning 90% of the machine learning algorithm consider that the data has been gathered IIT what IIT means and it independently draw to the same analytical distribution so there is some kind of probability distribution that you can say that is called mother nature that gives you some sample and those sample are dense and the fact that you have a patient will not give more probability to see another patient and all the things are base draw IIT this is something that is not true in fact because if you have a for example a research on cancer the people that will be on the data will be close to the research center that will participate in the study but we are not so far from that but if and also an example on which you will have to predict has to come from the same distribution and what can go wrong for example you have picture you want to differentiate cats and dogs so what a good learning algorithm will do here will differentiate between balls balls and balls and you say aha that is very funny but it happens all the time it happens to the U.S. Army the U.S. Army wants to make machine learning approach to determine if there are some tanks in the woods and so they make a training set but they take all the picture of the forest with tanks during the day and all the picture without tanks and so what the algorithm did look the color of the sky and the accuracy was perfect and we also make the same mistake once but here we make some master chromically analysis all the control on one day and all the on the second day and what the algorithm did detect the calibration of the machine and so this is something that often I would say that it is called causality correlation another point that they can go wrong we are statistical learning so you know this accident that Tesla run to a truck because Tesla was thinking that the truck was a blue sky because it was all white and the guy died because he was looking at the video because he was so sure that the Tesla was ready to use the algorithm is something that we cannot really deal with machine learning I will finish on the adversarial case that is very interesting so suppose you learn with deep learning you learn to recognize what is in the picture and there is a hacker look at your predictor and say well I will just not take this picture but I will just mix it a bit just changing a bit that much you see the difference between this one and this one not really but the neural network is saying that this is the panda and for this picture it is a monkey and you say well that is something that can happen sometime yeah this is what for the neural network it is a school store small modification what is this a switch this is an asterisk this is an asterisk so of course there is nothing with adversarial case too much I hope so but this is something you have to take into account and I think you can take it into account in your event suppose that I will finish with that suppose that you have data that comes from two different labs and because you are in problematic of fat data you cannot say well I will just take one of the two labs you will have to deal with the fact that from the two labs there is some slight difference of protocol and some slight difference of the calibration of the machine and so on so what you can do in fact this is quite future work for us so it is not something that we deal with but this is an idea we want to explore so this is a neural network that I will tell you so the neural network I am trying to do what to find a representation for which the prediction task that is of your interest will be good ok now what if I add an adversarial neural and then what I ask my neural network to do is to find a representation that is good for my prediction task but for which it is very difficult for me to identify from which lab the data is coming from so I wonder the representation that is good on the prediction task but that is bad to be able to differentiate from which lab the data is coming from what will mean this then the neural network will have to not taking into account an information that is specific to one or another lab and only to take into account what is common from the two lab and if it is still good that prediction I think in this case we will alleviate this the data is quite different and if you look at it so here is the toy data set called the two moon ok so if you learn neural network on the two moon data you see that it is very good at it ok if you add the data of the second lab and you ask the training to be both trying to predict correctly and to be adversarial you see that it has a more it is more difficult to find differentiated but what happened with if we use this representation that has been learned on that what happened if we with this representation we try to ah no electricity anyway ah it is better yes what happened if we try to differentiate the upper part which is plus one to the lower part it has some difficulty because the learned representation was for this task but if it is not so bad it is not perfect but not totally bad what happened with this representation then the algorithm can do anything of course it is a toy it is a toy data set and it is not necessary the case with true data set but this is exactly the idea and so I think it is some kind of solution that we can deal with when we have to deal with data that will come from different part of the of the work with different protocol but slightly the same protocol of course if they are too different then there is nothing to do I think that is interesting take-home message you should now know how to present a situation where machine learning is suitable for you basically you need a label example second take-home message be aware of the dangers of if I can fall into the pit you can more than me right and third do not forget that you can work with people in order to put a knowledge to an algorithm to enhance the performance of the algorithm and I will try I will be happy to know when we will help machine learning people to be fully adapted microbiome analysis I am waiting for that and last thing thank you