 Our speaker today is Guillaume Lamper. He studied mathematics at the Ecole Polytechnique in Paris and then moved on to Carnegie Mellon to do a master's degree in natural language processing. He moved back to Paris to do his PhD, which he finished in 2019. And he is now a research scientist at Facebook AI Research. A lot of his work centers around machine translation in an unsupervised way. And he has been awarded with the best paper award for his work on this topic for one of the major conferences in the field. The topic of his talk today is not going to be language processing or machine translation per se, but rather what understanding this topic can teach us when it comes to manipulating symbols in mathematics. So on that note, please Guillaume, the stage is yours. Hi, thank you for inviting me. So I'm Guillaume Lamper. And today I will talk about deep learning for symbolic mathematics. It's some work that we recently published at the iClear conference. So the motivation behind this talk was that if you look at current results in deep learning with neural networks, is that neural networks they perform extremely well on a wide variety of tasks. For instance, in computer vision, whenever you have to classify images, when you want to segment objects in an image, or even if you want to recognize faces, if you want to recognize speech, or even in natural language processing, for instance, if you have problems like machine translation, so let's say translating automatically from let's say English to French, or if you want to classify whether a sentence is expressing like a positive or negative sentiment. For all these problems, whenever it involves classification, neural networks work really well on their main state of the art in most of this task where we're able to apply them. But so what we wanted to see is whether it's possible to use neural networks, but in mathematics. So people have tried on, but the problem is that there has been a very limited success in simulink computation with deep learning so far. So surprisingly, what we have is that neural networks, large neural networks, they can be pretty good at, let's say translating English to French, but this model we actually perform very poorly whenever they have to do some very simple mathematical arithmetic tasks, for instance, integer addition, or even if you ask a neural network to multiply two very large digits, let's say two numbers with five digits, it will actually fail, or it will actually succeed, but only like in 80% of the time. So it doesn't work really well. And the thing is that most people in the field of deep learning for symbolic math have actually focused on these very simple arithmetic tasks. The reasoning being that if neural network are not able to do addition and multiplication, then there is actually no hope that we can use neural networks to perform some complicated math, like solving theorems or computing integrals or solving differential equations. But the thing is that actually humans are also not very good at this. If you give somebody like two large digits, to the large number of five digits, it's certainly actually that somebody can actually compute the product of these numbers. Of course, actually people have to write it down. They cannot like mon-telecomput this very quickly. So, but still humans are actually pretty good at like solving theorems on like computing integrals. So maybe we have only tried so far to apply neural networks in math, but on some of the bad problems. Maybe neural networks can be used in mathematics, but maybe they will actually perform better in more complicated problems than addition and multiplication. So this is what we wanted to see in this paper is whether it's possible to apply neural networks on more elaborated tasks, knowing that they actually don't work well on simple tasks like multiplication. So we focused on two different problems in mathematics. So one is a function integration and one is solving differential equation. So the thing, the nice thing about these two problems is that they can be seen as a particular instantiation of a machine translation problem. But instead of let's say transforming like an English sentence to a French translation, what you will do here is basically map, let's say an expression, a function to its integral. So you basically have to translate like a problem into a solution. All neural networks perform really well in this task like sequence-to-sequence model. Sequence-to-sequence networks are very well-studied and they perform very well. So we can see maybe whether they can be used for these problems. And it's also one of the reasons why we had some hope that neural networks may work well in this problem is that it's also a problem that involves pattern recognition. For instance, if a human is trying to actually compute the integral of a function, usually what you try to do is to organize some specific patterns. For instance, if you see that there is like f prime times cosine ucf for a particular function f in the function that you are trying to integrate, then it's likely that sine ucf will actually be part of your primitive. So if f is a small function, that is easy, but if f is like a complex expression, it's gonna be more difficult to notice these patterns. So it's really like some pattern recognition problem on neural networks that actually exceed and they perform extremely well in this pattern recognition task. For instance, when you do image classification, what you do is that you try to detect some specific pattern images or if you want to classify sentences, you try to detect some particular subset of words. So here it's actually a little bit similar in this sense. The difference though is that neural networks operate on sequences as you connect, you cannot directly feel like an equation to a sequence to sequence model. And also one of the bigger requirement is that they require a lot of data. If you train a machine translation system with let's say thousands or thousands of thousands of, let's say English to French translation, then your model will actually perform poorly. This neural network, they work well, but the requirement for them to work well is that you need to train them on millions or like thousands of millions of examples. For instance, in English to French translation, we use like typically 100 million sentence pairs. So you have basically a data set of 100 million English sentence with like the 100 million corresponding French translations, which you don't have like in function integration and differential equation. So we have to address this. Okay, so the outline of the talk will be the following. So first we're gonna see how we can actually make the problem compatible with sequence to sequence models. So how we can actually use sequence to sequence model to do function integration or sort of differential equations. Then we are gonna see how we can, I mean, what are the models that we use and also how can we create data sets because there does not exist any data sets of like thousands of thousands or millions of integrals on differential equations. So we have to like regenerate them so we can train the model on them. Then we are gonna see what is the performance of the models, the neural networks train on these data sets and how they compare with like other solutions, like commercial software, like MATLAB or Mathematica, that have been used for many years and that are like well established and developed. And finally, we're gonna discuss something which is important in machine learning, which is in machine learning, which is generalization and which is basically the question of if we train a model in a particular data set, do we have the guarantee that it will also work well on another data set, which is not guaranteed in machine learning. Okay, so the first problem is basically the question, how can we use mathematics as a natural language? Because basically sequence to sequence models so far have been mainly used to translate sentences or to generate text. But here we work with not text but with mathematical expression. So the way we do this is that we actually have to transform equations into a list of tokens. So a token can be basically, for instance, we have one token for Senus, we have one token for Cosinus, one for Tangent, we basically have one token for every possible function that we want to use. We have one token for X, one token for Pi. And the idea is that our neural network model will have one embedding. So an embedding is essentially a vector. So we will have one vector for each of these tokens. And if we have a large number, so let's say for instance just a number like 1,414, the question is how do we represent, how do we give this number as input to a neural network? So here what we do is that we split this into individual digits. So here we're gonna represent this as a sequence of five tokens. It's gonna be like plus one to one four. And the idea is that if there is, I mean since there is like an unlimited number of integers, you cannot have an unlimited number of, you cannot have one embedding per integer. If there are like one billion numbers, you don't want to have one billion vectors. So you split these numbers into digits. So you can actually represent any number using a sequence of tokens. Yeah, and the other thing is that we do, I mean the first thing that we actually do is that we take an expression and we transform it into a tree. And then from a tree, we can actually transform it into a sequence. So here we use like, I mean this is nothing new. We just take an expression and here with what we can do is just transform this into a binary on unary tree. So here for instance, if we have two plus three times five plus two, then the main node is gonna be a plus and we are gonna add two to basically the multiplication of three with the addition of five plus two. So this expression is actually equivalent to this tree here. And if we have this expression here, we can do something very similar. The difference here is that we have also a co-sinus. So here this is not a binary but a unary operator. So we have a co-sinus here that will actually take as input as a multiplication of two with six. And if we have some more complicated expressions, let's say you have some, let's say you want to work with differential equations. So you want to represent the derivative, the differential operator in your input. So what we can do here is have a tree. And for the value, for the root of some trees, what you can have is the operator differential. So here, if we want to represent this part of the expression, we basically have a tree of depth two. For the root node is differential operator. And what we differentiate is what is on the left and what is on the right is the variable with respect to which we differentiate what is on the left. So basically this expression here can be represented as this tree. So the first thing that we do is that we transform our expression into trees. And once we have a tree, what we can do is simply enumerate the tree. So we use that prefix expression. So we simply transform these trees into prefix sequences. So for instance, if we have this expression here, we can actually enumerate the root node. We enumerate the tree starting from the root node. So we have plus, we have what is on the left, it's a two. And then we enumerate recursively what is on the right. So it's gonna be like times, then three, then plus, then five, then two. So we start from the root, then what is on the left, then what is on the right. And then we can actually enumerate this tree into a sequence like this. So to summarize, we basically take an expression, transform it into a tree. And then from this tree, we can actually transform this tree into a sequence. And all the time, there is like a one-to-one mapping between an expression, the C corresponding tree, and then the corresponding prefix sequences. And since we work with sequence to sequence model, what we are gonna do is basically give as input to our models this sequence of tokens. And we are going to train our models to predict another sequence of tokens, which is the case of function integration. For instance, the input will be like a function and the output will be like the token of the integral. So yeah, this is gonna be like the input and output of our models. So now, as I was saying, so if we want this model to perform well, we need to have very large data sets. And unfortunately, that does not exist very large data sets of function with their integrals or differential equations with a solution. So we need to find a way to generate these data sets. So for integrals, we propose three different ways to do this. So one is very simple. It's what we call it's a forward approach. So the idea is that what you can do is generate a random function F and then you can use some existing framework like SimPy or Mathematica to compute its primitive big F. And then what you can do is simply add F big F to the training set. So you add like a training example pair. And then what you're gonna do is feed to the model as small F and we are going to train the model to predict the sequence of the elements that represent the function big F. So it's like the most simple approach to do this to create a data set like this. The only problem is that, I mean, there are many problems is that first of all, you need to require on some external framework. So you need to use Mathematica or SimPy or any other framework that can do integration. And so because of that, it's unlikely that you will be able to perform better than these frameworks if you train in your network on the data that they generate. The problem is that it's also very slow. For instance, there are functions for which Mathematica will take like thousands of seconds to compute the primitive on some time it will actually just fail. For instance, there are many functions for which there exists a closed form solution if you want to compute their integral but just Mathematica will not be able to find it. So again, you will only generate a subset you will only generate a data set which is composed of the subset of functions that Mathematica is able to integrate. So you will not generate like a perfect data set. So here are some examples of functions that you are going to generate with this approach. So here you basically have what is generated on the left. So here we randomly generated these functions. And then we ask Mathematica to compute the integral and we found these functions. So here what you can see is that the input is going to be very small and the output is very large. So here we train the model to predict what is on the right given what is on the left. So it's kind of, again, it's some sort of machine concentration problem where the input is this and the output is that. And yeah, and what you can notice here again is that the output is much longer than the input which is not always the case on your, that's what we're going to see with the other data sets. So the second approach to generate example of function with their integrals was to use what we call a backward approach. So here the difference is that we do not start from the function to integrate but we start from the integral itself. So we generate to random function but we call it big F. On here we can simply differentiate it to get the derivative small f. And then we just add again F on big F to the training set. So the advantage of this approach is that it's extremely fast. And it never fails. When you have some function, you can always compute it's a derivative and I mean assuming it's differentiable. And the problem though is that you will tend to generate functions that are extremely long. Like the derivative are going to be very, very long. Typically here are some examples of generated sample from this data set. So here we generated what is on the right. So we generated small functions. And if you compute the derivative of this function you will get what is on the left. So here you can see that the inputs. So what we are going to give to the model is composed of is basically very long functions on the output is very small. So you have very small solutions on the very long problems. But okay, on the other problem was that if you take an expression like for instance this one that we had before like x power three times log of x square power four. So the integral of this function is this very long function here. So the thing is that if you generate a function around them Lee, it's absolutely and I mean it will never happen in practice that you generate this function around them Lee. I mean the probability that you generate this exact function is almost zero. So basically with the backward approach we'll never be able to add some example pairs like this in your training set. You will only have a function with a very small output on the very long input. You will never have something like this because the probability of generating a function like this is basically zero. So this is why we have a third approach that we call basically the integration by part generator. And the idea here is that we can generate two random functions. So big F on big G. We can compute their derivatives. So small f on small G. And given this formula, what we know is that if we know the primitive of F times big G then we can actually automatically infer the primitive of big F times small G because big F times big G, we know, I mean we know what this is equal to. I mean we know these functions so we know the product. And if we have, if we know the primitive of this because we have already computed it and it's already in our training set then we can actually automatically infer the value of big F times small G. On this verse, if we know the integral of big F times small G, we can infer the primitive of F times big G. So the advantage of this method is that it generates a function that you will not generate with the other generators. But the problem is that it is very slow. But the nice thing with the last two approaches, the backward approach on this approach is that you can actually generate very large datasets without relying on any external frameworks like Mathematica or Simpa. You can actually generate that using like very simple code on no external API framework. So here are some examples of functions on primitives that you will generate with this approach. So here we generated basically these pairs where the input is again on the left and the output is on the right. So we are gonna train our models to generate what is on the right given what is on the left. And now you can see that examples like this like X power three times C new C public of X that has this particular primitive. We know of this in our training set while the backward approach will never have given this. We will never obtain this example with the backward approach because typically this requires to do multiple integration by parts. And this is why the output is very long. Okay, so now we work with integral but we also wanted to see if our model is able to solve differential equations. So here we have the same problem. We need to generate a dataset of functions of differential equation with their solution. So what we could do here is generate a random differential equation and try to solve it. But the problem is that if you generate a random differential equation you have no guarantees that there exists a solution to it or especially not in a closed form solution something you could express with symbols. So we do something a little bit similar to the backward approach which is that we generate a random function and then we try to see what is the differential equation satisfied by this function. So if we have first order differential equation we know that the solution, I mean there are many solutions but they are basically all the same up to a constant C. So what we do is that we generate a random function. So to generate a random function what we do is that we generate a random tree and when we populate the leaves of the tree we put some X on one constant C. So just one constant C. And what we do here is that we express the constant C in terms of everything else. So in terms of f of X. So to do this, so we basically solve in C. So it means that, I mean here it's pretty simple. You can just move the X on the left hand side. You can exponentiate this to remove the log. And if you put the X on the other side you have that C, the constant C is equal to X times exponential of f divided by X. Which means that this function is constant of X. So if you, because it's equal to a constant. So if you differentiate it with respect to X it's equal to zero. So you have that this expression here is equal to zero. And now you can actually simplify this a little bit because you know that exponential of f divided by X will always be non-zero. So if you remove this exponential you have, and then you multiply all of this by X you get the following differential equation. X y prime minus y plus six equals zero. And then you can verify that this function is actually a solution of this particular differential equation. So overall you can actually generate random functions with one constant C. This is, do this process here and then you actually get a differential equation satisfied by your random function. And you can also do this at the second order. So here is another example for the second order with again a simple input. So here we have not one but two constants, C one on C two. So we generate a random function with C one on C two. Again we are going to express one of the constants. So C two here in terms of everything else. So f X on C one. So we have that C two is equal to f of X exponential of X minus C one exponential of two X. So this function here is a constant of X because it's equal to a constant again. So we can actually differentiate this with respect to X, it will actually be equal to zero. So we have that this big expression here is equal to zero. And similarly here what you can do is again express C one because here you have C one there. We can express C one in terms of everything else. So you have that C one is equal to this expression here. And now if we differentiate this with respect to X we have this following formula. So exponential of minus X divided by two times f prime prime minus f. And if we simplify this by this factor which is always known as zero we get this following differential equation. And now we can actually see that this differential equation has indeed this particular function as solutions. So yeah, so here what we can do is generate like a 10 millions random function with two constants do this process. And then we will get like 10 millions of differential equations with their associated solutions. And we can do this again without using any mathematical framework. Okay, so for the data set in practice what we do is that so for all of this process we start from by generating a random function on to generate a random expression we do this by generating a random tree. So we generate random trees with up to like 15 internal nodes. And for all this node we are going to populate them either with unary and binary operator. So for the binary operator we have the addition like subtraction, multiplication, division and we have all this function for the unary operator so exponential log square root the trigonometric function and the inverses on the hyperbolic functions. And yeah, so now we can actually look at how many expressions exist we can generate. So here you have different curves. The X-axis is the number of internal nodes of the tree on the Y-axis is the number of expression and is the number of trees associated. And L is the number of possible leaves of the tree. P1 is the number of possible unary nodes. And P2 is the number of possible binary nodes. And you can see that for instance if L is equal to one and P2 is equal to one you have that this is a number of binary trees. So this is like the number of binary trees like the Catalan number that you can generate for this different number of nodes. And you can see that if you also allow your model to generate unary trees you have more possibilities to do this. And now if you allow yourself to have like 11 possible leaves, 15 unary nodes on the four binary nodes you have the blue curve which is basically what is, I mean, which is in our case this is the setting that we use. And if we have 15 internal nodes we basically have like something like 10 power like 45 possible expressions. So we have basically an almost unlimited number of expressions that we can generate to generate data sets. Yeah, so here just very quickly so this is a number of examples that we generate for all of our data sets. So for the forward, backward and integration by parts generators we generate 20, 40 and 20 million examples. And for the differential equation we also generate 40 million examples of input and outputs. And yeah, so sometimes we can have we sometimes have very long expression with up to like 500 tokens or like 500 tokens for the input of our differential equation. So as the examples that I show here are only very short but the model is typically trained with extremely long examples. Okay, so now in terms of results on how it compare with other softwares. So we compare the model with Mathematica, MATLAB and MAPL. So for the integration we only compare on the backward generator. I mean comparing on the forward generator will be possible but it will be not very interesting because Mathematica will get 100% on this because by definition the forward generator is the function that Mathematica is able to integrate. So we evaluate the performance of the model on the backward generator. And here we can see that Mathematica has like 84% accuracy while our model is basically almost at 100% accuracy. So the way we measure the accuracy for the model is that so here we use like a sequence-to-sequence model. So the way a sequence-to-sequence model works is that it takes as input a function and it will propose several candidates for the integral. So it will say, okay, here is the top 10 candidates that I think is a solution of this differential equation or that I think is an integral. And what we can do then is just, let's say we're doing integration, we can take like one output of the model, we can just differentiate it and if we retrieve the input, if we retrieve the function that we wanted to integrate, we can see that this was actually a valid primitive, a valid integral of the input function. So here you have like three results with beam size one, beam size 10 on 50. So beam size 50 means that we actually let the model generate 50 solutions on the model we actually get a point. So it will score one point if one of the 50 solutions is actually valid. On beam size one, you just let the model generate one answer and you consider that the answer is correct. I mean, if it's incorrect, you basically don't let the model generate another solution. And okay, and you can see that here the model has the most 100% accuracy. And in differential equation, you can see that mathematics has like 77% accuracy for the first order, 61% for the second order. While the model will have, if you allow it to let's say make 10 guesses to propose 10 solutions, we'll have like 94 or 73% accuracy. So for differential equations, what we do basically is that we give as input to the model one differential equation. We ask the model to propose one solution and what you can do is just take this solution, plug it into the input differential, I mean differential equation. And if it's equal to zero, it means that this was actually a valid solution. So we can actually easily verify whether the model outputs are correct or not. So we still need some symbolic framework to verify the generations of the model. But the idea here is that the model is very good at proposing candidates on solution. And then you can actually verify them using a symbolic framework. Okay, yeah, just one other thing. So here we actually use a time out of 30 seconds for Mathematica, which means that if the Mathematica doesn't find the solution in 30 seconds, we actually consider that it fails. But in practice, we try to put 10 minutes, but this doesn't really change the results. Usually Mathematica either returns the solution instantly or just doesn't find the solution at all. So here are some example of functions that Mathematica is not able to solve, but that our model is actually able to solve, for which our model can find the solution. So for instance, the first example here is a function to integrate. So you have to compute the integral of this. So Mathematica is not able to compute the primitive, but here the model is actually able to find the solution almost instantly. So here it's basically the arxinus of this function here. So interestingly what happens here is that if you give to Mathematica, not this, but if you slightly re-factorize the denominator, so if you put one minus basically this square, so Mathematica recognizes that it's actually the form of the inverse series basically. So if you slightly rewrite this by one minus the square of this, then actually Mathematica can find the solution, but if you expand the denominator like the way it is right now, Mathematica just fails to compute the solution. And here you have an example of a first order differential equation that our model finds and that Mathematica cannot find. And here you have the same example for like a second order differential equation. So in this example, the model instantly finds the solution while Mathematica fails completely. So here are some other examples of generation, which I think are quite interesting. So here you have some inputs, which is a first order differential equation, so this, and the model, as I was saying, can propose different solutions. So if we have a beam size of 10, it means that the model is actually able to propose 10 different solutions. So here the model will suggest 10 different outputs. And whenever the model proposes a solution, it also proposes a score. So here, for instance, the score is typically like a log probability. So if you have a score of zero, it means that the model was confident with probability one, that this was like the good hypothesis, the good solution. And interestingly here, what you can see is that all these 10 different solutions proposed by the model are actually all equal. All these 10 different equations are actually exactly the same, except the last one. But if you actually take a different value for the constant thing, you actually see that it's actually exactly equal to this nine other expressions. So interestingly, the model here was able to find that these 10 different formula are actually all equal, even though we actually never trained the model to do this. So we only trained it to compute the integral of functions, but the model was able to find on its own that these 10 different functions are actually all the same. Yes, so earlier in the talk, I mentioned that there were some differences with the generators that we can use for the forward, the backward on the integration by part, the difference being essentially that the expressions are very different lengths. Typically, the forward approach tends to generate functions with a small number of tokens, but a very long number, sorry, a small number of tokens in the input, so small input functions, but with a large number of tokens in the output. So the integrals are very long, like the green curve here. The backward approach was exactly the opposite. So we have very small inputs, sorry, very small integrals, but the outputs tend to be actually very long sometimes. On the integration by part method was actually a bit in between. So we have, at the same time, short inputs and short outputs. So in machine learning, what happens is that if you train one model on a particular distribution and if you test it on another distribution, which is very different, it will tend to perform poorly. So this is what we wanted also to investigate here. So what we can see on this slide is that, so we train four different models. We train one model on the forward dataset, one on the backward dataset, one on the backward plus the integration by part dataset and one on the three dataset that we can get in idea. And for each of these four models, we basically evaluate the performance on the three datasets. So what we can see is that if we evaluate on the forward dataset, we perform very well on the forward dataset in blue here, but we tend to perform very poorly on the backward approach, like functions to integrate that come from the backward generator. We don't even have to 20% accuracy on this. On the backward approach is similar. If you train your model on the backward dataset, it will perform very well on the backward approach with almost 100% accuracy, but you will get only 20% accuracy on the forward dataset. So now what we can do is concatenate the backward data with the integration by part data. And what is nice here is that these two datasets were generated without using a Mathematica or any symbolic framework, just like some very simple Python code. And if you train a dataset on this, a model on this dataset, you will get so like an almost perfect accuracy on these datasets, but you will also get more than 50% accuracy on the data coming from the forward dataset, which is basically the dataset on which Mathematica is able to compute functions, integrals. And finally, if you train a model on everything, so on the forward dataset, the backward on the integration by part dataset, we can see that the model get an almost perfect accuracy, almost 100% everywhere, which means that if you know exactly the distribution of the function that you want to integrate, if you have a generator for this distribution and you train a model on this, then your model will perform very well on this distribution. But the only issue here is that we can never be sure that that does not exist, and that probably exists. Another distribution different than the three datasets where the model will actually perform very poorly. Maybe there are expressions that are not generated by the forward, the backward on the integration by part approach, that this model will not be able to integrate either, which is basically one of the main issues with machine learning. Yeah, so finally just some other results on generalization. So the forward model, so again, was trained on the function that the SIEMPI was able to integrate. It's basically, so our forward dataset was generated with SIEMPI, which is like the Python equivalent for Mathematica, even though not as performant. And we only train our model on the subset of functions for which SIEMPI is able to compute the solution. And interestingly, at this time, what we found is that this model was actually able to integrate function from the backward dataset that Mathematica is not able to integrate. So that SIEMPI is not able to integrate. So for instance, if you look at these inputs with these particular solutions here, these functions, our model is able to integrate them, even though SIEMPI is not, which is kind of surprising because in the end, the model was only trained on the subset of functions that SIEMPI is able to integrate, which means that in some ways the model was able to integrate, to learn, to generalize beyond the distribution of SIEMPI. And okay, so that's it for the talk. So the paper is online, as well as the code to reproduce the results on the pre-trained models. And I'm happy to take questions now. Wow, cool. Thanks a lot. That was a great talk. So questions. I would encourage people if they have a question either to raise their hand or in case they think their mic doesn't work, then please type them in the chat and I will read it out for them. And I can see we have one question there. Louis, please go ahead. I wanted to ask a question about generalization. So if you have a particular expression you want to integrate, I guess you've got no real way of knowing whether it's sort of the training sets that you've used are similar enough to that function. And I guess you just have to differentiate the answer yourself and see if it works. Is that it? Yeah, so basically, if we are given a new function, so the thing that our model doesn't know, our model can propose solutions, but it will never know for sure whether this solution it proposes are correct or not. So the only way to do this is to actually, yeah, we take a function, get a solution, differentiate it to actually see whether we retrieve the input function. And if we do not retrieve the function to integrate, it means it was not a correct solution. So we can actually take another one, I mean try another one, but yeah. So this model here is basically used not to replace like formal system, but just be used I think on top of formal systems. So you actually propose this candidate solution and then a formal system can verify whether they are correct or not. You have no other choice. The whole thing looked pretty impressive, but just one other question, and that was that you have this example where the sort of 10 best solutions, the best 10 solutions are all equivalent to each other and they have different scores. I guess the network doesn't know that they're equivalent to each other, is that right? Yeah, exactly. So basically what happens is that the model is generating one embedding. So like for each of this, whenever you feed an expression like this as input, your model would generate one vector. So it's basically one vector of dimension 500 and it will actually decode the solution from this. And here it will actually decode all of them, but it would actually, yeah, it means that basically there exists like a space of dimension 500 in which all of this expression, they have a very similar vector, like they're close in the vector space. But the model, I mean the model can suspect but it will never know for sure that they are correct. But in practice, it's kind of what we observed. If you have that two function have a very similar embedding representation, even then, even if they are written very differently, it means that they are basically equal. But yeah, we can never know for sure without like a formal system on top of this. Yeah, anyway, looks very impressive. Good talk. Thank you. Okay, there's also a question on the chat by Peter who asks if you have considered the structure of the algorithms that are used by these symbolic packages like Sympy or Mathematica and whether you have considered what first-year mathematicians are learning and whether or not you think this could help to find better models that can do the same job? Yeah, so for Sympy, we didn't really dig too much into it. We basically took it like a kind of a black box, like a package to basically provide us with the data set. What we tried to do was to see whether we can understand what the model is doing by for instance, looking at the attention heads and basically, it's usually, yeah, the problem with this model is that it's usually a bit hard to understand why they output this or that. They basically don't tell you why they actually output a particular thing or another. But there are ways to basically investigate this. For instance, the model can actually predict a token and you can look at what it's looking at to predict a token. For instance, maybe the model will actually output, start output in Cosinus, only look at what it's looking at in the input before it predicts Cosinus. But the problem is that when we did this, so this method they were developed to analyze the model in natural language processing. So for instance, when you want to understand why a model is predicting this particular label for a sentence. So we tried to apply these same techniques to understand what is happening in the model. I mean, why it's doing this or this or that, but we didn't really, it's very difficult to interpret. We were not able to find like any new algorithm or something like this. Even though, I mean, the model is definitely doing something, but we are just not able to understand what it's exactly doing. That's really, I mean, we will need to actually investigate this a lot more to understand how it's working. Amen, please. Yeah, thank you. Very nice talk indeed. I would like to connect to Louis's second part of the question. Have you considered training the model on simple, equivalently reformed terms? So we see here these, the selection of terms which are equivalent to each other, but the model doesn't know it. But that's because the model hasn't been trained to know this. So, and this is a very cheap training. There is, it's very inexpensive to reshape expressions in quasi infinitely different ways. So if you were to do that, would there be some hope that the weights that we see here would actually become much more similar? And the model, in that sense, would start to know that these equations are the same. Yeah, that's a, yeah, it's a good idea. I think if we were to do this, we could actually train the model to do two things. So one is, I mean, the tasks that we have done here. So solving a differential equation on the second will actually be two. Yeah, we could actually train the model to map an expression to any other expression equivalent. Or we could actually do something like train the model to basically match, I mean, what you could do is, let's say we have an embedding for this formula and we have another embedding for this formula. So two vectors, we could actually train the model to minimize the distance between these two vectors. We could say, we could add to the loss, like minimize the distance between these two vectors. And actually, I think it would work. This element on its own is already, would already be an ingredient that would help existing packages like Mathematica to improve the integration ability. Because as you, have you just seen, you've actually put up an example of that same problem that Mathematica has, is able to integrate the same function if only it is written in a differently factorized way. So if there is an element that can simply re-represent elements of functions in different ways, then using those in conjunction with your standard symbolic packages can already be a big step forward. I think so, yeah. I mean, there are definitely some ways to get some gains into Mathematica. I mean, I'm not sure exactly what people would be interested in here. But for instance, here, the model was not working. I mean, Mathematica didn't work because the denominator here was not factorized. And having a model that actually detects this expanded expression on how they can be factorized, it's actually very simple to do for a neural network. And actually, yeah, it would be, I think, possible to take a model that has input functions, normalize them by either factorizing them or rewriting them in a way that are easier to understand for the above model like Mathematica. And yeah, I think these tools, like Mathematica and neural network would work well together. I think it's definitely something that will happen at some point. Only issue is that the neural networks are a bit expensive, typically. I mean, it's better to have a graphic card or it's a bit slower. It's not that bad, but yeah. At some point, I mean, it's obvious to me that in the future, these tools will be integrated in Mathematica, I think, because yeah, they will always be able to do things that we can actually not write with, like, I mean, we had a look at the SIMPI code very quickly for integration. It was really, really complicated. And it's obvious that a model that you can actually train like a black box like this on that performs so well should actually be useful for all these problems, I think. Thank you. Yep. Okay, Will, your question, please. Father, I take it you train the numbers on like decimal symbolism, right? Yeah. Okay, so do you have like any idea of like the limits you can take the model's understanding of the decimal number system too? I mean, even in like simple algebra, is there a limit up to which that it can either multiply or divide? Yeah, so yeah, so I very quickly talked about this at the very beginning. So we had initially like one way to represent this. I think, yeah, in the end, in the problem we used, we didn't have any float numbers. I think we implemented that at the very beginning. I think we, there were other things we wanted to try, like for instance, see from like some approximated solution of like an expression that can be written that can be written in closed forms. The model will be able to actually detect the closed form. So for instance, if you have like, let's say some approximated function, can the model look at this floating numbers and detect that, yeah, this is definitely looking like Cousinus of this and can actually map this floating representation to some, to the associated like closed form formula. But so yeah, that's something we wanted to try, but we didn't really investigate that. So I'm not sure, I'm not sure how it will work with the floating numbers because in the end, we didn't end up using them for the integrals. I'm not sure, sorry. I think it also, yeah. Oh, that's a perfect answer. Do you have any idea like whether this is good at just kind of like reordering? Have you tried it on more abstract reordering where you kind of just entirely randomly make up the problem that you have a set of rules in which the symbols can be reordered and then train it on that? So what we tried to do, so I don't know if it's exactly related to what you have in mind. So what we tried to do is for instance, if we have expressions like let's say A plus B where the model is able to integrate A and the model is able to integrate B. So automatically it should be able to integrate A plus B by just summing the integrals, but somehow it happens that the model is able to integrate A and A plus B, but not B plus A. It happens, it's not very common, but we found some examples like this where just some things that should be automatically doable for the model are actually not. And we actually found a few instances like this where the model is actually not robust to this very simple modification. If you just, yeah, A plus B is working, but not B plus A somehow, but yeah. Do you have any ideas to like use like a GAN architecture in order to have one model producing more difficult problems and another model solving them at the same time or do you not know if that would work? Yeah, I think it's a good idea. I think only difficulty here will be that, I mean, as far as I know, we still don't really know how to use GAN for sequences yet. I mean, people have tried that for quite a while, I feel, I even spent quite a bit of time at quite a bit of time on this like a few years ago, but the thing is that when you basically want to differentiate, let's say your GAN generates something, that's an expression, let's say F on the model is not able to solve this, then you want to give this feedback to the model, but you don't know exactly how to do this. I mean, there is no like a continuous signal you can back propagate through this discrete generation of tokens, so it's like discrete output. So you cannot differentiate with respect to that. And because of this, we don't really know how to do it. GANs are working well in images because like when you generate an image, basically an image is like a matrix of float numbers between zero and one. So we know how to do this well and the signal is going through these images pretty well, but we don't know how to back signals through the tokens. So I mean, it would be, if we had a way to do this, I think it should work, but we don't know how to do this yet in deep learning. Okay, thank you very much. Can I ask what the structure of your network is? Yeah, sorry I didn't talk too much about this. So it's basically just a transformer model. So there's this paper, like the attention is all you need, like that presented the transformer architecture like in a few years ago. And we basically use like the vanilla implementation of this model. So it's like a six layers model with dimension 500. So it's a pretty small model. And something interesting that we found is that actually the model performs pretty well even with a very small number of parameters, like in natural language processing, usually for the model to work well, you want to have like 10 layers or dimension 1000. But in our case, we found it was just like six layers or even like four layers of dimension like 100. It's already quick enough to have like a pretty good performance, which also means that, which is nice because if you wanted to use this maybe like in some symbolic framework like Mathematica, you will actually be able to put a very small but still very powerful model. But yeah, we just used like the vanilla sequence to sequence model of the initial paper. There's another question on the chat from Herschel who is asking about your compute requirements. So how many CPU or GPU hours did it actually take to get to that performance? So it's very fast actually. You can just try and like a model in a few hours, even in one GPU. I mean, I suspect if you train it on your laptop with your GPU, it will also work after like one or two days, which is also, yeah, again, one of the nice thing with this dataset is that you can train on it very quickly unlike in natural language processing where you really want to have like eight GPUs. Here, we tried to scale the model to train on like 32 or 64 GPU, but the performance was basically the same as with just one GPU. So you can train this with a very small architecture on the hardware requirement, yeah. Only thing that was a bit expensive to generate was the dataset. For the backward method, you can generate the entire like 40 million example like very, very quickly, it's super fast. However, for all the other dataset, it was a bit slow because, I mean, typically, yeah, so for the back, for the forward, you basically, if you want to generate a training set of size 20 millions, you basically want, I mean, you have to call Mathematica or Mathematica, I mean, at least 20 million times plus then all the other times where you actually feed an input function but then Mathematica will fail. So it's actually pretty long. But yeah, but the backward integration by parts are much faster than the others. So I think the limitation is that the difficult part is to generate the dataset. That's, the rest is very inexpensive. But we released the dataset if people want to try running models on this. Okay, thanks. So I see no more questions. So maybe ask one last one myself, which is, which I apologize for it because it's very open in the question perhaps, but do we have any intuition for how far this approach can be stretched? So in other words, is it conceivable that you have a model that you train on all the two statements that you generate starting from some axioms and then you expect it to have some intuition for if you give it a certain theorem whether or not that's, that has a proof in that axiom system. Yeah, it's a, you mean like theorem proving basically? Yeah, it's like, yeah, it's, yeah. So it's exactly what we are working on right now. We are considering like several systems like a light on the metamat. It's a bit more difficult to work with than this because so, so yeah, here what we do is that we basically have an input and we generate like one solution and then we, that is just like one step. You have an input, generate one candidate solution and it's done, you can, it's either correct or not. For theorem proving, yeah, you have this, you have to do this kind of MCTS research where you have like one goal, you apply a tactic, you get like some sub goals, you try to serve the sub goals and you do this recursively. So you very quickly have some gigantic tree and it's a bit more challenging and much more difficult than this problem. Also the systems that have been developed for, I mean the formal systems have been developed, I think, without machine learning in mind at all. So it's very difficult to plug some model on top of like a clock or light, typically. So yeah, but that's still work in progress. So don't have so much to say currently, but yeah. Great. So any last words, any last minute questions? If not, then I would like to thank you very much again, Guillaume. I've got one more question, sorry. Sorry, we'll go ahead. In terms of that theorem proving, are you looking at more kind of abstract algebra terms? Are you looking at so-called harder problems? I don't really know how to describe it very well, but yeah. Yeah, so we use, I don't know, so for instance in a light, you have like the proof of the Kepler conjecture which is there. So it's like a lot of theorems, a lot of statements, a lot of definitions. So we just use that and we just, I mean, we have like 40,000 proofs, formal proofs that we actually split into like sub proofs and we actually like, we have like 40,000 trees and we take a small subset of that to make a test set and see it's like what we use to work. Yeah, the issue that it's very difficult to generate large data sets. I mean, these techniques of forward, backward, you cannot do that in a light. But the math, I mean, the mathematics it covers is pretty wide. I mean, I think there is pretty much, I mean, there is not everything but there is like complex analysis, group theory, a lot of stuff, arithmetic. So yeah, we basically just, I mean, somehow we don't, I mean, we work with this data but we don't really understand the data ourselves. We just give it to the model and it figures what to do with it. It's a bit more, it's a bit difficult to interpret. Sometimes the model is suggesting some tactics but you have no idea whether it's a good idea or not to just try it on the, yeah, we are not familiar with the formal systems really. So yeah. Okay. Just one more. I don't know if you use like a kind of word to vex kind of pre-processing for your sequences but if you do, do you find that there's any kind of underlying manifold structure in the kind of the clustering of objects or symbols? So we don't use word to vex but yeah, there is some sort of manifold. It's basically like a bit related to these equivalent solutions here. So here we have 10 different solutions. And if you look at their embeddings, the location of their embedding in the vector space, they're actually very close to one another. And similarly, if you have something like, I don't know, if you have like procedures of 2x you'll have an embedding for that. And if you have the other formula which is equal to that, like 2x square of x minus one actually have an embedding which is also very close to that. So we observe that similar functions have an embedding which is close in the space. But yeah, so we observe it. Word to vex could actually be more at the token level. And here if we actually plot the embeddings at the token level, I'm probably, I think we will have that the sinews and cosinews are nearby but since there are not so many tokens, I don't know if there is so much to get out of that. It's interesting for word embeddings because you have like a, if you work with English you have thousands of thousands of words so you can observe things. But in our case, I don't know if, I mean, yeah, at the token level it's interesting, yeah. Okay, thank you very much. Thanks. Okay, there's one last question by Yangui who is asking a question in connection to a recent effort or collaboration by Google which they call the Ramanujan machine where they try to generate new identities, continued fractions and so forth. And where you had any idea of how this connects to your work and whether you could use models or your approaches like yours to generate new identities? Yeah, so I think I'm not entirely sure but I believe the Ramanujan approach was more like some, wasn't it some like more brute force kind of approach? I forgot if this was using some machine learning or not. But yeah, no, I'm not, I actually don't really know the paper very well so I don't have like an interesting answer for this story. But yeah, if you wanted to find some identity, I mean, I guess what you could do with these approaches, I mean, yeah, there are things you could try, for instance, maybe, I mean, you could generate like a random formula, like, I don't know, let's say exponential of square root of N divided by N plus one or something like this and evaluate this for N equal like a bunch of different values and then you could actually train a model to retrieve the original formula given the evaluation of this formula at different points. For instance, if you had something like, I mean, I think like there was a story with Ramanujan and when I think we knew the evaluation of the number of partitions of an integer at different points but we didn't know what was the exact formula for this and then you can have some intuition about the formula and just write it down immediately like you did without having a proof for that. I think you could maybe try something similar with the network by just generating random formulas, you can do this for a limited number of equations, evaluating them at some random points and then see if the model can retrieve them. I think yeah, it's something we slightly considered but didn't really have time to investigate much more. Okay, now that was it, wonderful. Thanks so much again, Guillaume. That was really great and thanks a lot to everybody who has asked questions and that's it for today. See you next week, bye. Thanks, thank you very much. Thanks. Bye everybody and thanks for the talk. Thank you.