 So, good afternoon, no, it's not yet there. So, good afternoon everybody, thanks to the organizer. So, I will do an orthogonal talk with respect to Manfred, because there will be not much question, but an application of machine learning to real data. And these are data from proteins, so sequences of proteins. And the aim is to design new sequences using Boltzma machine and the restricted Boltzma machine. So, these are basically two works done in collaboration with Remy Monasson, Jerome Tuglana, Martin Bait, Matteo Fiuci, and two experimentalists, Ramaranganathan and Bill Ross. So, the idea is that protein are chain of amino acids, and it's very difficult to understand from this chain. No, it's not working. From this chain, how this sequence will fold and will function. And, for example, this is the case of a small protein domain, which is called WW, which has to bind a small peptide. And it has been tried with, for example, protein folding simulation to guess from the sequence the structure of the protein. And this has been tried over, I think last 50 years using this simulation, but it has been shown to be real, very difficult. And also, I would like to understand, also to underline that protein has to do a lot of stuff, has to fold in a stable way in its native fold and not on other folding. It must, for example, in this case, bind in a specific way this peptide. And normally, once it has bind, for example, the peptide, it promotes a reaction with the target light, and sometimes it changes the conformation. So, the idea is that we would like to read all these constraints from sequence data. And now we can have sequence data, because there are a lot of, for example, sequences of this protein, this WW domain, which are collected in protein database, and this is in PFAM. So the idea is that, and these sequences, so are all sequences which work well for this protein, and they are sampled on different organism, so real and structural data, and we can represent them as point on this landscape, in which basically, OK, here, so in which they are all functional points, because they are sampled in living organism, and what we want to do from these data points is to infer a function which go through these points, and this would be the probability that a sequence be a good sequence for this protein. So these will give you a generative model, which will be, for example, which will be used to predict the cost of a mutation from a given sequence. So imagine you are here for the sequence of the rat, and this is called the wild type. You do some mutation in some amino acids, and you want to see if it will be deleterious or not. Once you have inferred this probability function from data, you can make a prediction. And also what we'll see now is how to use this model to design new sequences, which have not been sampled through evolution. So here our variables are the amino acids on the sequence, so the sequence has n amino acids, and these are variable, multi-categorical variable, taking 20 possible values, which are the possible amino acids, and there is also another symbol, which is called a gap, so there are 21 possible values of the variable for each site. So this is called a POTS model, and so we want to infer this model from the data. So I will try to infer this generative model, show you to infer this generative model using two kind of machine. These are the BOSMA machine in which each unit of my system, each amino acid in my protein, is connected through this coupling matrix, and this is the key parameters we want to infer from the data, so they are couplings between each pair of amino acids, each possible pair of sites. And then we'll use also the restitable BOSMA machine to learn the model from the data, and these have different architecture, so our variable, our amino acids on our sequence are connected to an hidden layer, and we will infer these couplings, which are these weights parameter between this visible layer and the hidden layer. And then we'll use them to generate a new functional, and I will show you the experimental test of these new functional sequences. So let's start by the BOSMA machine to design two protein, which are called WW, the one we have already seen, and a protein, which is called the charisma mutase. So how do you infer a BOSMA machine? The idea is that you start from this data, this alignment, and you want to have a model which reproduce just the one point and two point statistics of your data. So these are just the frequencies of each parts variable at each position, which also called conservation, and the correlation between two parts variable at two position. And so in some sense you are thinking that these frequencies and correlation are important, something which is important to tell you, which are the structural and functional constraints on your protein. For example, in some site like this, you will see always the same amino acids, because this will be, for example, a binding site, so it needs to have a given type of amino acid to specifically bind some ligand. And on some other site, for example, this one, you will observe some correlation so you can have this configuration or the other, and these amino acids are charged. And these correspond to sites which are nearby in the trinimation structure of the protein, and so you need to keep them oppositvily charged to keep the structure of the protein. So what we'll do is just to infer a model in which there are two kind of parameters, local fields on the variables, on the amino acids at each position to reflect the frequencies of the amino acids and couplings to reflect the correlation you observe in your multisequence alignment. And this is also called the maximal entropy model, which is able to reproduce these frequencies and correlation of the empirical distribution of data. So you will learn the parameter of this model, which are the fields and the couplings, just maximizing the log probability of this data. Here, this configuration of data, given the parameter with respect to the parameter. And so we'll, and this corresponds to find the couplings and fields parameters for which the average value of the variable over this probability distribution are the frequencies and correlation which you have empirically measured. And so one important point in this optimization is the regularization. So it's important to add regularization terms, which ensure that you are not overfitting your data. And in particular, it avoids the parameter to go to minus infinity if, for example, on one site there is no, there is not an amino acid. And so you fix a prior for your distribution of fields and couplings. And finding the optimal regularization is something which is normally difficult and depends on the goal of the infrared model. So what we then use this infrared model for, so first application, which has been very, very important over the last 10 years, and to which Martin Weiteri Karela has contributed, is to find these couplings matrix and to identify the largely coupled sites. And these are sites which indeed one can show they are nearby in the three-dimensional structure of the molecule. So really these couplings give you some structural information on the 3D folding of the protein. And then this model has been used also to predict the cost of mutation and we'll see the application to design new sequences. So in practice we can sample this log probability with a Monte Carlo simulation and find sequences which have small energy or high probability in this model. And these sequences will be candidate for being good sequences for this protein family. And so this has been, in some sense, related to what Ramaranganathan has tried to do since 2005. So the idea is to build new sequences reflecting the sequence statistics of the amino acids in your multisequence alignment. And so he tried this in 2005 on an alignment of WW. At the epoch there were 120 sequences. Now for this protein you can find 10,000 sequences. And what he did, he just scrambled the alignment in such a way to keep the frequencies of the amino acids at each position and also in another alignment, he made two alignments, one in which he just kept the one-point frequencies and one in which in some way he managed to keep also the two-points correlation. And then he test this, so basically he test 40 sequences. EC means the one in which he kept only the conservation. So the frequencies, he test 40 sequences on alignment in which he kept also correlation, 19 random sequences and 42 natural sequences of this alignment. So he sanitized all these sequences and then he studied experimentally if they were folding or not. And what he found, it was that the sequences in which built only reflecting just conservation were not folding as the random one, but the one built also reflecting correlation of the multisequence alignment where folding 30% of them were folding and of course 70% of natural sequences were folding. So what we did, we just built from the alignment, we have now a POTS model on this data and first use the energy to score the sequences which Rama Ranganathan has sanitized and then to sample new sequences by the Monte Carlo simulation. And of course once, if you want to build the functional new sequences, you have to be sure that you reproduce well the statistics of your alignment. For example, you reproduce the one and two-point correlation and you reproduce also high statistics, for example, three-points correlation. And so for this you really need to use accurate inference methods, as both machine learning and in adaptive class expansion we have developed some years ago and also the low regularization are really important. So here are the results. And so here I plot just the energies of the different sequences Rama Ranganathan had tested. So there are basically 30 sequences which were built in such a way that they reflected first order and second order statistics. And here the red bar are the sequences which fold the sequences built, building just from the alignment reflecting only the conservation and the natural sequences. And these are scored by the energy of the infrared pods model. And you can see that indeed the energy seems to be a good indicator of the ability of these sequences to fold because the one which have, which do fold are the one which have the smallest energy. So we also design new sequences with Monte Carlo on the infrared model and try to propose these sequences to Rama Ranganathan to test them. And, but at the epoch he was already working on another protein and this was an enzyme which is called the corresponding mutase. So it told us give me some sequences for this enzyme and not for the WW. And this is what we have done. And so this is from the point of view of the experiment. This was a much more challenging experiment because for this case Rama wanted really to test the function of the protein. And so this is an enzyme which is very important for the synthesis of two amino acids which are the tyrosine and the phenylalanine. And you can measure in vitro the reaction rate of this enzyme. So how fast the two amino acids are produced in presence of this enzyme. But what Rama did, it was just to take, so a library of sequences corresponding to this protein. And then what he did, he took a bacterium, the E. coli, he knocked out J, the D corresponding to this protein of the E. coli and put on the plasmid the synthetic gene we had just by machine learning proposed. And then what he did, he measured the growth rate of the bacterium with the mutated gene. And this growth rate with respect to the growth rate of the bacterium with its own gene is the measure of fitness. And you can see that this measure of fitness is indeed related to the biochemical reaction rate you can measure in vitro. So this is a very essential enzyme for the growth of the bacteria. And so these are the results with our design sequences and this is a work in collaboration with Martin Weiter and Matteo Fiuci. And again, this is a plot very similar to the previous one. What a plot is are the energies and the bar correspond to the different sequences Rama had tested. So now he had tested 1,000 natural sequences and 2,000 artificial sequences. And so here is the distribution of energy for the natural sequences and these are all working natural sequences. So again, these are natural in the sense that they are taken from the multisequence alignment but they are not the normal sequence for the E. coli which has its own normal sequence. And then he tested our sequences so we generated sequences at different temperature as you can see here. And the one which were basically working they were the one which we generated at low temperature. So at temperature one third there was 53% of sequences which were functional and at temperature two third there was 26% of functional sequences. At temperature one just 5% of functional sequences. And then the other question is how is the structure of this space of sequences do we reproduce well this structure with our synthetic sequences or we just are sampling for example sequences which are very nearby to the original E. coli sequence and so it's clear that they should work. So what I plot here is again this inferred pot's energy as a function of the distance or this is the opposite of the distance is the percentage of identity with respect to the closest natural sequence and again you can see the one which works here in red with respect to the one which do not work in green are the one at lower pot's energy and they can have a lot of dissimilarity with respect to the natural sequences because they can have up to 40% of the amino acids which are different with respect to the closest natural sequences. So they are really new design sequences and if you look at the pot's energies versus the distance with respect to the E. coli you will see here that there is a cluster of sequences which are very nearby to the original Y, one which is E. coli, but other which are just 20% of amino acids in common with the natural E. coli sequence. So you generate very diverse sequences but then we wanted to better understand why 50% of them were not working and this is related to the structure of the space of sequences which is not random so the natural sequences have a well-structured space and you can see this just projecting so these are all the natural sequences which were tested on the two principal components of the correlation matrix and you can see clusters which correspond to sequences which are nearby in this space because for example they have similar functionality, similar specificity or just they are phenonegetically related. And so what you can do is just separate by a linear separation of this space and again all the working sequences are the one below the line which are more or less below the line which are the red one and you can take the same separator and in this case you will have that 80% of the artificial sequences we have designed will work. So in the key message here is that you really have to take into account the fact that you have a particular structure of a sequence in subfamily and sometimes if you are far away to your wild type sequence in this space it's more probable that your sequence will not work. And we would like to try to convince you in the rest of the talk that you can use another machine which is a restrictable machine which is able to also take into account this structure of the sequences space. And so we'll do the same things, we'll infer a model with a generative model with this Boltzmann machine on the WW alignment to start with. And so why we think this machine is really powerful to learn representation of your data space because they have these two layers so if you start from again your multidimensional data point space so each sequence is here is a point on this space and what you do with this hidden layer is that you introduce another space which is a representation space and what you want to do is to in some sense that this space will represent the low dimensional space on which your cluster you can cluster your data for example with the different functionality of your protein and the specificity and the activity and so you find the different subfamily and this is exactly done by these hidden units which are able to extract some feature from a sequence and so we will build this probability of having an activation of an hidden layer variable for example this one given an input sequence and so you map in some sense the space of sequences into the space of hidden layer variables and you can do the converse that is fix some hidden layer configuration for example this one and generate with the probability of the sequences having fixed hidden layer new sequences somewhere so this is what we are going to do so you use this structure to generate sequence in a given portion of your sequence space so in a given subfamilies of your sequences and we will see the application to this WW domain and so these are the parameters of the receivable smart machine and so basically what you have is this bipartite graph so the set of two random variables this visible units random variable which are the sequences in my case and this hidden units variable and you write a distribution over these two sets of random variable as it gives function in which the energy contains three kind of terms the first terms are like fields on the visible layer units then there is a terms of potential of on the hidden layer units and the terms of couplings between the two layers through this weight parameter and so if you have defined this energy function then you can do exactly what I shown you before so for example, fix take just a configuration of variable in input and calculate the probability of activate the different hidden units and this is done just first projecting your sequence on the weights vector and this will give you an input on these hidden units and then you have the probability of the activation of the hidden units with value h nu given this configuration of visible units is just the exponential of minus the potential on these hidden units plus this input terms multiplied by the value of the hidden units so in this way you can exactly achieve the mapping I have shown before from the sequences to the activation of a given configuration of hidden units and then you can do also the opposite from fixing the hidden units you can calculate the probability of having generated sequences fixing a configuration of hidden units as the exponential of these local fields plus this again this coupling terms between the two units so these are very powerful machine to go back and forth from the real space of your configuration to this space of features and you can integrate over all possible configuration of the hidden units to obtain the probability, the likelihood of a given visible units and you use this likelihood to train RBM through unsupervised learning to reproduce your data set as the one which are the most likelihood and so you fix the parameter of your network so to go back on the parameter of your network there are two kinds of parameters first parameters are the hyper parameters the one which are fixed before training and the parameter which you fix during training and the hyper parameter are the number of hidden units and the shape of the potential and the regularization again which are important so there is a nice work of Jerome and Remy in which they have shown that depending on the hyper parameter you can put the machine in the good phase learning and sampling phase in which so learning is easy and sampling also and this is what they call a compositional phase and basically this means that each sequence activates a finite number of hidden units and in the reverse each combination of a finite number of activation of hidden units can generate good sequences so we have trained this machine on this compositional phase so putting potential which are the re-lue so there are re-lue with double re-lue potential it's just an extension of re-lue potential and sparse weights and the number of hidden units typically of the same order of the number of variable units and just to show you as you fix at the end the hyper parameter of your machine by cross-varidation so the idea is that you divide your data in a training set and in a test set and then you, for example, you vary here the penalty strength here we put a L1 norm on the weights parameter so here I will show you what you have as weights parameter when you train the data on the data of WW machine and this is an example of weight parameter and we have represented the weight in such a way just because you can really see the motive which are important on this sequence so specifically what this means is that this representation is similar to a log representation in sequence data and the idea is that each letter represents an amino acid one of the 21 possible parts variable and the height of the letter is proportional to the weights for these amino acids and it can be positive or negative and so you can see by height that when you train this machine on a data set with a good regularization you will have some motive which are important for this protein and if, for example, you fix regularization to zero so you don't have regularization this is what you have and in this case also you can see that this is a really bad idea to avoid not fixing any regularization because the log likelihood of the test sets is much lower than the one of the train set so you are just overfitting your data and you could also choose a value which is right here in the maximum of the log likelihood of the test sets but you will have less interpretable weights so we have trained our machine in this regime of regularization and these now are the results so we wanted to apply to this family of WW because it's known that WW has four specificities for the binding ligand which are different so practically the multi-sequence alignments separate in different sub-families of sequences which correspond to different specificities for the ligand and there have been defined four groups of specificities so this is the ligand there are always proline amino acids but in the middle there are different kinds of amino acids and so Raman Ranganathan in his work on this protein has shown by PCA analysis that there are basically eight positions here which are very important for the binding specificity of this protein and so I would like to show you these two weights and basically what we have so these are two among the 30 weights that we have found after training this RBM on this sequence data and these are very important weights because you can see that they are localized on two groups of amino acids here which correspond to the amino acids here in green and these are the sites which were known as being important for the binding specificity and also this other group of amino acids on this loop here and the idea is that you can recognize indeed on this motif you have on the weights some patterns which are important on your sequences to be for example sequence of type 1 so here all the amino acids you see on the bottom of these weights correspond to amino acids which are present here on these type 1 specificity sequences so practically what you have is that these four these two weights are able to partition the space of natural sequences in three clusters so here the dots are the sequences in the multisequence alignment and the color correspond to tested sequences and what represent is the value of the inputs that these sequences gives to these two units hidden units 3 and 4 and you can see that you can cluster with these, thanks to these inputs this space of sequences into the three principal cluster which are type 1, type 2, and 3 and type 4 sequences and so then once we have seen that you can indeed use cluster these sequences what we have done is just use this machine to generate new sequences and you see that again you can generate sequences in the three cluster but you can also fix the hidden unit value and generate sequences only in particular cluster for example if you fix positive value for hidden unit 4 and negative value for the hidden unit 3 you will generate sequences with binding specificity of type 1 so in some sense so we have been able to also generate sequences with given feature in your multisequence with respect to your multisequence alignment and you can also make a new combination of hidden variable to generate new sequences which would have a new type of specificity which has not been seen up to now in the multisequence alignment I think there is just one sequence which is in this cluster here and for the moment we have not tested these sequences but what you can do is you use your log probability to just score your sequences so here are the natural sequences and again here I show log probability as a function to the distance with respect to the closest natural sequence and here are different cluster corresponding to RBM which have generated data of the different specificity class and we have also generated and you can see that you can score the log probability of these sequences and you can also generate sequences at low temperature as we have done for the Boltzmann machine and in this case this is really easy, you have just for example to replicate one hidden unit so in this case you go from the probability of the sequence to the probability of the sequence square so it's like you are at temperature one half so just as conclusion I've shown that the Boltzmann machine can be useful tools to design functional sequences and this has been experimentally tested and validated and in this case it's really important to precisely generate use algorithm which are able to precisely reproduce the statistics of the multisequence alignment but the Boltzmann machine are not able to identify the partition of the space of sequences in sub families so to extract the structure of the space of these sequences and to do this we have used restate Boltzmann machine we have seen that under a specific condition such as the weight sparsity the non-linearity of potential this machine can learn a compositional representation of data and they can achieve a very good trade-off between interpretability and performance so you are able to have patterns to really see in the weights some patterns which are important for this data set and you can use RBM to generate sequences with given specific properties in your family so in given clusters and we are looking for experimental validation also for these RBM and we are looking for a design of the sequence so so so so so so so so so so so