 Hello everyone, welcome back for those who are part of Youth in High Dimensions and welcome to the others for this very special lecture from Michael Douglas, who is part of the Center for Mathematical Sciences and Applications in Harvard and Stonywood University. Michael is an original string theorist, if I'm not mistaken, but after some time he got very much interested into machine learning, which is, I think, right on the spot for the topic of the present SMR, so we thought it could be nice to combine. And so Michael will approach, I think, a pretty timely question, which is, how will we do mathematics in 2030? So I'm very keen to hear about that. Thank you, Michael, for being here, and please, you can start. Thanks, John. Thanks, everyone. It's a pleasure to join you at this exciting conference, and always a pleasure to visit the ICTP. So I'll be speaking about the title and the overview of mostly ways that computers have and will change the way we do, not just mathematics, but the mathematical sciences. I've been giving this talk since late 2019, and it's evolved. And in fact, I think it's clear that progress has become very much, has accelerated. Things have gone faster than what I predicted, but I, so I've changed the talk a little bit since the beginning, but pretty much what I said has held up. So I'm going to really start by giving a lightning sort of a history of computer science, because to understand the developments, I think you have to put it in some perspective, and then extrapolate from that, and of course, from all the amazing progress that's going on now. So, let's say, okay, there we go. Okay, so, well, I mean, the history of computer science, I mean, there were earlier ideas, but the real history begins just before and during and after World War II, and in particular, the first electronic computers, which were very specialized devices. People wanted to simulate, people wanted to calculate trajectories and design atom bombs and such, but very quickly, the pioneers of computer science saw that these were much more general devices, and you could imagine doing all sorts of things and understanding many subjects in a new way. And so a lot of what we have today is built on these original insights, this idea of a worldwide database that would connect everybody to the store of knowledge, goes back to the 40s and 50s of the integral of Bush. This is a prominent subject sometimes called cybernetics, control systems. It has many, many names started then trying to control systems using the ability to feedback and not just have simple feedback systems. The concepts of artificial intelligence was a famous meeting in 1956 where that term was proposed, and I'll get into that. A term that's less used today and kind of brings together a lot of ideas of computing that are based on phenomena in other areas of science. So Rosenblatt, famously the, who proposed the perceptron model inspired by a neurotic human brain. Lulemin, Feynman, and Sayuri Tomit. Holland was one of the earlier workers on the genetic algorithms. And so these founding ideas go way back, and then of course computing very quickly, and I'll get into some examples that has had an impact. The ability to do a simulation already had a strong impact. And some of these early ideas of artificial intelligence and particular automated theorem proofing, one of the first projects of Lulemin Simon or artificial intelligence system, they called a theorem prover that would do what it said. And not, it didn't take long to realize that this, although as a philosophical or even a question of mathematical logic, you could make an algorithm that could prove all, this all proves, and thus prove all provable statements. That's an algorithm that takes exponential time to find, a proof of length L takes exponential of L time. And one can try to optimize that. That's a lot of the story of early AIs, how do you search? How do you search through a tree of moves of a game or steps in a proof? But that was a difficulty that took a long time to make progress. And even doubling the depth of the tree that you can search doesn't get you very far. But a lot of the, again, the ideas that we use now, you know, the maxim of early symbolic manipulations, those ideas go back to this time to the sixties now. Okay, and then if we're going to make a very broad history and point out what would relate to most significant developments, I think the next to that belongs in that foundational category is the internet. And that's usually changed the way that everybody lives, but in particular the topic of this talk, how scientists and mathematicians do research. And the story I tell from my own experience to illustrate this is string theory of some of you, so the ICTP is a very active place in string theory. And when I was a grad student in studying string theory at Caltech in the 80s, it was kind of a frustrating place to be. So it was what we now call the first super string revolution. And it was very, very exciting, you know, clobby on the handfuls and a heroic string. And the grad students at Caltech, where I was, you know, worked hard to keep up by reading preprints, you know, how did you find out what was the latest results in research you would occasionally get this box in the mail full of printed out papers, preprints. And they would come from the different institutes doing, universities doing work. So we would eagerly open these and read the preprints and absorb ideas and come up with our own ideas and hypotheses and start working on it and make some progress. And then the next box would come. And our ideas would already have been anticipated and we would see papers about it and it was frustrating. And the progress was very much centered in Princeton, the Institute for Advanced Study, to some extent a few other places, Harvard. But not very largely in one geographical area. And so that was the mid 80s and then already in 1993. So the archive started in 1990, 1991. And then in 93 there was what we now call the second super string revolution with duality, black hole state counting, eventually things like gauge gravity duality. And this you really can point at that the person that started it was Ashok Sen working in India in 1993. And then the subsequent development of that was very different than the way things had gone in the first super string revolution because Sen stated the forefront and many people were at the forefront. And the research was really dispersed around the world. I remember in the mid 90s paper came out of I think it was Czechoslovakia, by some guy we'd never heard of. And this guy really wrote a pretty good paper on exactly what's the latest ideas. It was a Lubush model working more or less alone. And again, many, many stories of people who could catch up and stay current with the internet. So very much changed the way that people worked and very much brought in the democratized cutting edge research around the world. And so we could go through a lot of examples of how does this kind of communication affect research the way we discussed, the way we think. But I think if you had to pick one, the one that really not only was a great success, but surprised everybody when it was started, people did not think it would work at all is Wikipedia. So again, I'm talking to much of the audience, Wikipedia. Hard to imagine that that didn't exist sometime. And that was, of course, created and begun in the early 2000s. And then there was very quick expectation that no, this thing is just going to break down if you let everybody edit an article. Then how can it ever maintain any standard of quality? Doesn't somebody have to look it over and supervise? And of course, somebody does have to look it over and supervise. There are chief editors in Wikipedia and a kind of a hierarchy and discussion. But it requires much less work than the previous method. It kind of works. Again, I wouldn't necessarily trust Wikipedia, although it's much more trustworthy than chatGPT, you must say. But still, as a starting point for understanding something new, this is really the best that we have at the moment. Okay, so now we're in this third technological leap of machine learning. And again, I think that since 2012, everybody would agree this is of equal importance to these previous steps forward. And since I started giving the talk, that was a period when Google and the others were hiring thousands of profession of PhDs of all sorts. Physicists in particular, mathematicians and so forth. Now we've gone through a whole cycle of hiring and less hiring and firing. But the overall growth of this profession and this style of work, the impact continues to grow. And so the central topic of this talk, how will we do mathematics, mathematical sciences, physics, and the rest in 2030? The first thing we'll point out, of course, is how these developments of machine learning and AI change the way we do things just as these previous developments change the way. Okay, so that's the bit of history and setting. And so now there's this broad, broad area of computational mathematics which is not new at all. And certainly long predates electronic computers. And as famous, Ramanuja, we can point out is unbelievable genius in coming up with formulas by commutation and by pure intuition. And of course, that's a very, very live and important tradition. And computers add to it. And you can point out some subjects, certainly specific results, but then subjects and mathematics and physics that really became possible because of computation, and I think we're on good debate. But I think that the best example of this is dynamical systems there. The defining concepts, of course, are very old and people worked on the dynamic celestial mechanics, the dynamics of the solar system centuries before. But it's just too hard. And there's really the intricacy of non-enegrable mechanical systems. It's just too much to do it all with pen and paper. And a lot of the properties were discovered eventually to conjecture through the results by computer simulation of ODEs and PDEs and the like. And many other areas of math, discreet subjects tend to be easier. If you look at the classification to define a simple group, say actually that was the complicated ones, such as the monster were very much reliant on computer calculations to complete them. The Birch-Sleinert and Dyer conjecture, I'll talk about it a little bit more later, was based on calculations done in the 50s and 60s by its originators. The Simon's collaboration on arithmetic geometry is a living descendant of this kind of project. And so that's an area of the mathematical sciences where we can try to make extrapolations for 2030. Well, clearly the computers get faster, get cheaper, even though the original idea of Moore's Law, well even the original idea of Moore's Law that the transition gets smaller has kept going up until more or less now. So it's died 15 times Moore's Law, but now we're down to three nanometer features and that has been used. So this idea that even if the actual components on a chip don't get smaller, the tools for packing more of them into one space and running things in parallel are continually improved. So it's again a reasonable simplification of situations that's to say that every year the available computation for any given class of problems doubles. And so if we're within a factor of 1,000 of computer time and resources to solve a problem now, then that problem will just be solved by doing what we're doing now with the factor of 1,000 better computers. That's an easy prediction. Okay, I'll go through a few of these others. I'm going to go through this one very quickly because it's interesting. But it takes a little time. So there are many, many techniques which are not machine learning. Maybe that's the main point that this topic is going to illustrate that are potentially, they're already of great value to industry, to researchers and many subjects, they can be applied to mathematical problems. And a good example of that is what's called the Boolean Pythagorean and Triple Problem, okay? So you take the positive integers one up to some n and the question is can you color them or equivalently divide them or partition them into two subsets such that no subset, neither subset contains a Pythagorean triple, a squared plus b squared equals c squared. And it would be interesting, it's interesting you could estimate what's the likely n for which this will fail. But then the intricacy of the solution is such that it's going to be very hard to come up with a, even writing down the coloring and verifying it for some valid n in the thousands is going to take quite a while. And it was proven to be possible for n equals 7,824 and impossible for 7,425 by a Heuler et al in 2016. And how did they do that? So this is a problem that you can barely directly encode into propositional logic. So you have n variables labeled your x1 through xn and we'll say that if x1 is false then one is colored red. If x1 is true, one is colored blue and so on for the others. And then each Pythagorean triple turns into a logical clause basically stating that for that triple the three variables don't all take the same values, that's straightforward. And then you can take this long system of roughly a million clauses and feed it to what's called a SAT solver, a program which just looks for solutions of systems of Boolean equations. Or the somewhat harder part can generate a proof, a certificate that there is no solution. And these turn out to be very, very efficient so that one can do this dystopic calculation in a day or two. Reasonable size computer, but this is people in industry deal with much bigger systems in this. And so this proof that it doesn't exist for this larger room is something like 200 terabytes long. So in the sense of understanding, why do we, again, in some rough sense, I think you could estimate that the end, it'd be interesting to get an argument to get a good estimate without all that work. But that's probably the understanding in this case, but then the details are intricate. And then there are extensions of this that can be adapted to solve somewhat more. They tend to like these sort of intricate problems whereas what's called a SAT plus CAS and there's a group of Waterloo in particular that specializes in this. That you combine the SAT solver with symbolic algebra to generate the initial clauses based on whatever the equations or mathematical structure you start with. And so an example of a problem that they may progress on is this so called a Williamson conjecture. Can you find four symmetric end by end matrices? All of those entries are plus or minus one such that the sum of these four squares is proportional to the identity. And they exist for any rank, any dimension up to 35. Every even dimension up to 70 proven not to exist for 35, why is that? Well, maybe there's some answer. Anyways, very interesting source of results and potential conjectures. Okay, so machine learning, neural networks, again, one could, in early versions of this talk, I kind of said this is obviously important. But now I think one has to say more. This is a plot of year versus log of basically amount of training time put into say developing a model, writing a paper and so forth. So measured in some pedoflops per times times days. And pedoflops for second times days. And this is the history of these machine learning problems. And the basic feature of this graph is this inflection point here in 2012. Okay, so this is when the Alex net this this great advance in computer vision was made and ever since people have been doubling the amount of computer time spent on machine learning every 3.4 months. And this went up to 2020 and I suspect that if you continued it, it would have a further increase right around 2022. So the language models. Okay, so people here all know this. I mean, this slide, I guess I'll say it because not everybody's at the conference. This is kind of three standard paradigms of machine learning that you could be given a data set and the input might be the position in one of these plots, giving two of the input features or variables. And then you might be given a target to predict such as blue versus green, supervised learning. You might not be given anything but the input and you're trying to now either cluster or estimate a probability distribution. Supervised learning or you could be in this somewhat different paradigm where there is something you're sort of trying to predict and optimize. But you typically don't get signals much later for the classic example being you play a game and at the end you find out you won or lost but you have to somehow figure out which moves that I make led to winning and losing and so there are these three general paradigms and they all have their applications. And the slides are made available afterwards. These are various relevant comments which I think I will skip in the interest of time. And then if you're applying this, well, again, there's many ways you can apply it. But if we're going to talk about pure mathematics, theoretical physics, I think the primary application is actually to a synthetic data. It's to say, obviously we're not building an apparatus and doing measurements. I'm an astrophysicist or an astronomer, of course, that is what I'm doing. But if I'm a number theorist, that's not what I'm doing and you might wonder. But of course, there's a huge amount of data in my subject and this is what people like Gauss were filling up notebooks with and that is the data that is this primary resource now for the computer to try to look for patterns. And do machine learning. And then in physics, we know the laws very well in all normal situations. We have the standard model, we have the many theories, in fact, the theories it leads to. And so we have very powerful abilities to simulate and get synthetic data that way, the properties of the simulation itself, are non-trivial and there's a great deal of structure there that one can look for patterns. So an example of a mathematical data set would be to make a list of knots. And there's an infinite list. So a data set is, if it's infinite, you can sort of call it a data set. You certainly can't give it to a computer. But what you can give in principle to a computer is either a concrete finite list of knots or to be a little bit more general, a probability distribution over knots, right? So this is this particular data set which was used by Davies about two years ago now to predict a new relation, improve the relation between invariants and knot theories. This was all the knots up to 16 crossings. And then a set of reach knot, well, algebraic invariants, okay? So these are, have the Jones polynomial being an example that physicists certainly know about. There's some rule for manipulating, you draw the knot and you manipulate, you cross strands over each other and you can compute this invariant. And then there are geometric invariants. And the general story of those is it's not obvious, but if you take a three-sphere and then excise, cut out the knot, you get a three-manifold with a boundary, but then you can put a natural hyperbolic, a metric of constant negative curvature on it. You get this natural hyperbolic space of finite volume associated with the knot. So that volume, the curvature is minus one, that's the scale. And the volume is an invariant. There's many other invariants that you can then base on that geometry. You can put gauge theory, turn song as gauge theory and define an invariant. And what Davey Zidale and Mark Lackenby were able to prove is a bound on the signature, basically, in terms of these geometric invariants. And how did they did this? They did this by data science, basically. They did it by taking this big table and trying for each column of the table, can we fit that column as a function of the others? Say the signature as a function of the geometric invariants. We'll feed it just to a general neural network and see if we can get it fit. And so there were various candidates, in particular, signature depending on some list of invariants here. And then they used clever attribution analysis to say, well, okay, which of these features really were controlling the fit and which ones were kind of secondary? And this is a paper one can learn quite a bit from in terms of the, again, real data science, but applied to this, what I like to call a platonic data set. Some of the data set that it is data, but somebody often, the adromatic galaxy could construct the same data set, just from the definition of not. There's something called the LMFDB that some of you will know about. Data set includes modular forms, it includes elliptic curves. Again, I don't think I'll go into detail, but this is the background for this Birch-Twinertian-Dyer conjecture, which was, in a way, an early example of this machine learning paradigm. It turns out that it's something where you can do it by linear regression. It's basically a prediction in terms of a particular ratio of products. So if you take along, you can do linear regression. But to do it with the computers available in the 50s and 60s was quite a tour de force. Okay, so here's an example with physical simulations. And so it's celestial mechanics, I guess. So very, very classic problem. And one that quite a bit, simulation sort of answered. In fact, it continues to be progress on this. But maybe in some sense, the answer to this age old question, is the solar system stable, was more or less found in 2009 by simulations like Estanel and Leskar. And in general, it's not stable. Okay, and what that means, because it's a chaotic system, is that if you wait long enough, eventually you'll eject some planet. Or equivalently, if you go for some billions of years, but you slightly vary the initial conditions, about 1% of those will lead to ejecting a planet. And there's actually an interesting relative stability there, because it turns out that the Lyapunov exponent for the interplanets is like a few million years. So why it should take billions of years to eject a planet is also non-trivial. But anyways, this is my study, but then you can of course generalize the problem. This has become very interesting to astronomers who discover and study exoplanetary systems. And so you can ask the general question. Suppose I say that our end planets, I know their masses, I know their initial conditions. So is it stable or if not, what is the expected timescale of instability? And then that's obviously a very intricate problem. And there would be a straightforward way to do machine learning to try to solve it. You do lots and lots of simulations. You do them long enough to see whether the system is stable or observe instability at some time scale t. And then try to predict t from that initial data. And if you do it that way, it doesn't work. And it's kind of obvious that it's not gonna work because the system's chaotic. And it really depended on very, very precise details of that initial condition. But the thing that does work is you run the simulation for very much less time. So they want to predict instability on scale of a billion years. But let's only run it for 10,000 years. But now let's look at the position, the orbital elements, every orbital period, every quote year. And then combine those features, okay? So now it's exploring the phase space. And they've made various summary statistics is the point where it's not pure machine learning, although later work did it in a more kind of pure way. But they used known ideas in this problem to try to make summary statistics which bend the machine learning. Program predicted your border between stable and the billion year timescale unstable. So that's a nice example of combining, again, simulation which is exact. Some physical knowledge about the problem. But then really machine learning to do this kind of hard intricate actual answer to the question. Okay, now I'm gonna just skip, it's just too much for one good talk. And I can maybe come back to some questions. This is a very nice application of graphical neural networks to cosmology from Princeton, a flat iron group that took very large cosmological simulations. And was able to, these were expensive super computer simulations, but they came up with relatively simple fits to predict, say, the density of dark matter to predict to corrections to force. I should say the simulations are with dark matter and the goal is to get some sort of simple description which in a sense integrates out or takes a new account via dark matter. And from a technical point of view, what's kind of interesting is the graphical neural network is there to take into account all the pairwise interactions and then just fit the machine learning fit. And then there's this second step of symbolic regression where you can take a general relation, you know, x, y, such as the neural network is fit. And then look for a symbolic expression that well fits that numerical relation. And then often that comes out with relatively simple formula that you can then interpret and use. And another example of this was a work with a economics colleague at Harvard. Something called the theory of gravity and economics, okay? And what's that? So you look at the trade between pairs of countries. And so there's the imports, the exports, and you make a big table. And then you can fit the amount of trade between any pair of countries really pretty well by saying the size or the GDP of the first country times the size of the second country GDP divided by the geographical distance between these countries and say you see why it's called that. And so by economic standards, this just fits our square root of that point forward. By economic standards, it's really pretty good. It's actually for such a simple model. And so then there's quite a lot of work on this. This was 1960 and in particular there are models with higher, like three country, multiple country effects, like if one country is very good at making a certain thing, then that will change the patterns. And you can take that into account. And we just did it, have an issue with a graphical neural network and a symbolic fit. And we got his model as good as any with some real similarity with these handcrafted models that had higher order effects in them. So it's another example of this. I'm just going to make this point very quickly. So this is statistics very concerned to this conference. And I feel that the rest of mathematics is being more and more influenced by statistics in many ways. And one example of that is that instead of hand crafting a model. And the example I give of that is data analysis at a collider. So this is the electricity detector at LHC. And there are these complicated events with millions, with thousands of particles coming out. And then theory tells you that these come from some small event where you have a small collision, you may be a couple of quirk coming out. And then they give rise to jets with hundreds of particles. And you have to identify the jets as the first step of the analysis. And so people developed a variety of ways of doing that. But then a relatively straightforward way that works just as well, if not better, is you imagine this interaction region that's surrounded by a cylinder, which it is, the detector. And then you just characterize the event as how much energy was put in each pixel of this cylinder, each little region. And then we can define a distance between events as the Wasserstein distance, the total movement of energy to take one event is energy distribution and move it into another energy distribution. So the formula here, which in this audience people know. And then do clustering with respect to that distance. And then that is very similar to one of these handcrafted algorithms. But it's a general technique. And so by general approaches, both because we have, we know so much more. And because our computers are better, so we can get away with using general approaches. A lot of problems map onto each other. And you don't need all the specifics of the inventions in each particular field. Probabilistic models in number theory. I mean, that's an interesting topic. The example I give here is not machine learning, but it's a probabilistic model. It's a simple model of elliptic curves. And it predicts the distribution models, I should say. The distribution of ranks of elliptic curves. This is from 2012, 2013. And one can do various things, but this model in particular strongly suggests, if you believe it, that once you get past a certain rank 21, there should only be a finite number of curves of higher ranks. So you open question, but a larger picture in which one can try to make such statements. OK, so now I'm going to switch gears. And again, probably try to go fast and tell you something about interactive theorem proving. So this is, if there were questions specifically about topics there, we could save them for the end. But if there's something, OK, I'll keep going. OK, so going back now to the 50s and the 60s, and this theorem prover, and this led to this concept of automatic theorem proving. I mean, principle is AI, but it's also just something to try to program a computer to do, and that did its whole branch of a computer science called formal methods, which is quite useful when you have problems, software, things that have to be as reliable as possible, like airplane autopilot. The example, in fact, an important example is that the first Pentium processors that Intel made in the mid-90s, they implemented the, at that time, new floating point standard. And they made a mistake in a few kind of boundary cases, things that hardly make any difference. And so they kind of said, ah, who will care about that? But people were so distressed that both it was not producing guaranteed correct results and that when you check your software, then you had to worry about this all the time that they wound up recalling. And it cost them more than a billion dollars. And now they verify, using the technology I'm about to describe, their designs, in particular the floating point unit in every chip design. And this is an example, it's probably a little hard to read in the back of how this is used in software development. And so this is the Koch theorem proving language. And this is the definition in this language of what it would mean to be a sorting algorithm. So a sorting algorithm takes a list of some order set, but here we'll say natural numbers into another list, such that the new list is a permutation of the old list and the new list is a sorted. And then there are pretty straightforward logical definitions of all of those concepts. What it means for a list to be sorted, well, the order relations have to be compatible. What it means to be a permutation, there's a nice recursive definition that you can take two permutations and append the same elements in front, append elements in their transposition in front. It's transitive. And then it's a topic taught in courses, textbook. Take a program, like an intricate quick sort program to sort a list, and use the theorem proving language to prove that it really satisfies this relation. Okay, so this technology exists and people have used it as well for mathematics and for proof in mathematics and famously examples like the four color theorem I'll get to, but let's again give an example. So the system which is attracting a lot of interest and I'll say more is it's the lean to our improving system. And let me give this an example, the fundamental theorem of algebra that a polynomial of a non-constant polynomial has a root, and then this is the statement in this lean to our improving language of what I just said. So the polynomials F, we have to get somehow justified a hypothesis to the degree of F is positive. And if so, then there exists some complex Z so that if I evaluate F as Z, I get zero. So pretty straightforward, logically precise claim and now here I've put up informal but easily made a rigorous proof and some calculations added to this will make a complete proof. And the strategy I've described here is you first show that the magnitude of F attains its minimum someplace and then you essentially use a holomorphy to show that that minimum really has to be zero, otherwise there would be a direction that would decrease and there would be a contradiction. And so this is the lean theorem proving expression of that first step of the proof to show that the function attains its minimum someplace. And so I don't know how much of it people can read so I'll go quickly. The line, the second line of this proof is the statement in again formal terms that I just made, you know, there exists an X so I should have for some Y, polynomial value at Y, absolute value is less than or equal than the value at any X and the rest of it is the proof. And it's not an easy thing to read. So it's precise enough and clear enough proof that it can be verified. The computer can just say 100%, this is true. The program that does the verification is sufficiently short that it's been checked in multiple ways. It's a combination of logical, obviously rules of combination, previously proven lemmas with names like exist for all le of compact, which you would have to look elsewhere. If you were using the system you would be able to hover your cursor over that and it would explain what that thing meant which at least helps somewhat. And then they're strewn through this various what are called tactics like when it says simp it's a sufficiently simple simplification of the goal statement is trying to prove that a computer can try to figure that out itself. Just cancel like terms, do some sort of simplification. Rw is an explicit rewrite. Let's take the consequence of this previously proven lemma and rewrite the goal. If the previous consequence was A equals B and I see an A, rewrite it to be a B. So it's somewhere in between computer programming and mathematical proof, it's more like computer programming. It shares commonalities with both. And so this is something one can learn to do with at present about a year of practice. And I will refer you to Kevin Buzzard's blog and numerous demonstrations and exercises in the math library. But it's potentially a very powerful technology for any mathematician that wants rigorous results and would obviously save huge amounts of effort and refereeing and the rest if one could just write this stuff more simply. And so where will that go? Okay, well one direction it could go and this was a project of Tom Hales who actually led a sizable project. This is Tom Hales who proved the Kepler conjecture about packing of balls and three dimensions would be to say you can take the statements of theorems like the second line here. Those tend to be not that hard to write and imagine some sort of a library where people when you write a paper just as you have math sign ad and the ZB math and so forth that systems of abstracts you could have a formal abstracts where you explain a result of your paper in this computer readable term. So that's a possible direction to go. Just as an illustration of where this is at the group of maybe two or 300 people have learned how to do this and have a Zulip channel that you can look at. And the library with which more or less has got somewhat past the standard French undergraduate math curriculum. This was outlined by Patrick Messell he was saying that's why it's the French curriculum but in any case it has quite a bit more but systematically it has pretty much everything at that level and so these are some examples formally verified proofs using this system. One which attracted quite a bit of attention and was completed, well it was begun about two years ago it was called the liquid tensor experiment. So Peter Schultz a famous mathematician and Dustin Clousen have a program that they call condensed mathematics which I would have a hard time explaining very clearly what it is but in some very high level sense ones taking techniques from number theory techniques that are used for example in pediatric analysis and generalizing the statement to the point where you can use them in a real analysis. And so Schultz had come up with a foundational lemma for this approach which he proved and then his proof was sufficiently intricate that he wasn't satisfied and he wasn't even 100% convinced it was true and so he challenged this community of people using this theorem proofing system. You can verify theorems, verify my proof here and so they took it up and within something like four months they were able to verify that part of the proof that he was not convinced by and so they answered his original question and last summer they finished it and so it's completely verified down to the initial definitions which is in terms of not set theory but this type theory that it's based on and Schultz not only was satisfied about his theorem but he said he learned things by reading this proof again that people had written but to get it to work in this computational framework lean things had to be rephrased in ways that he found illuminating. Okay, so that's kind of the status of math use of those things and I refer you to Kevin Buzzard's ICM lecture for last summer for more. Could you use AI, could you now combine the two themes of this talk, could you use AI to make this theorem proofing something relatively easy to use or even someday the computer could start coming up with its own theorems improving them and then what would it involve? I mean there's the, I showed you I didn't try to explain it but one of these proofs it has the actual logical structure of the proof each step involves either choosing a premise some already proven statement from this proof some already proven statement from the library of millions of things that people already proved or a tactic selection to say well simplify the expression here use that element that you proved to rewrite that equation and so forth and so you can regard the proof as something completing the proof as something like a game of solitaire and if you sit with the computer doing this it really is kind of like that video game where you try out a tactic and it lights up green if it works and your list of things that you have to prove suddenly gets shorter or not and so it's kind of like a game of solitaire and so what does this suggest? Well obviously this suggests that you use the techniques of which we're so well to play go and chess and the rest and that was a reinforcement learning seems like a long time ago now 2016 and anyways this was a approach people tried quite a bit around 2018 when they started giving this talk to use the same reinforcement learning system trained on a database of exactly these theorems and libraries that mathematicians had already proven and then it again it learns how to play this game of solitaire and can get up to proving you split long proofs into units of maybe five to 10 lines and it could prove maybe 75% of them to be more precise you take the big library things people have proven and you hold some of them out you train it on part of it and then the test is how many of the ones that you held out can approve and it was 75% last I checked. Okay another project you could try to do and again this was kind of a dream back in 2018, 2019 would be to translate the archive math papers into the logical framework such as lean that I described in, that I showed you an example of and let me skip ahead because this stuff has accelerated quite a bit thanks to famous large language models and I could give a whole talk and I have you can look for example at a talk I gave at the institute almost a month ago on this topic but what does a large language model do? You've all read quite a bit about it already but it's a statistical language model first of all it predicts given a sequence of words it predicts a probability distribution for the next word and then by iterating that process these successive conditional probabilities of course you get effectively you define the probability distribution over strings of words over text and you're just trying to model you have this big corpus of text quote all the documents on the internet you're just trying to model the probability distribution and it's an old idea but it's in 2017 the discovery proposal of what's called attention made this really take off like to this transformer model and I won't go into the details in this talk but the idea of attention is that generally we're taking words and representing them by embedding in some high dimensional space like 12,000 dimension for GPT-3 and then we might operate on the words through some learned function of a neural network but of course the meaning of a word depends very much on its context the other words around it and so the transformer provides a precise way to combine the embedding of the various words and this is an example of a sentence where the interpretation of the word very much depends on what's around it so the animal didn't cross the street because it was too tired the animal didn't cross the street because it was too wide well it doesn't even suffice just even know the grammar you have to know that to understand the reference of it here well an animal can be tired a street can't be tired a street can be wide and then this was GPT-2 I believe already getting this level of interpretation understanding what's the right antecedent of that pronoun it and so I'll just give a few examples of why I'll say a little bit about this okay so this was people started making these language models you know 27 again the transformer language model specifically 2017, 2018 and then again had kind of good enough results to be convinced that this was better than previous such as recurrent neural networks but unclear how far can this go and a rather important development in 2020 of Kaplan and I was this idea that language model performance satisfies simple scaling laws okay so it's kind of an old idea in machine learning but they had the data to actually look at language models in scaling and so here I've plotted graphs from their paper that have on the x-axis some measure of the size or the resources you know the number of parameters of the model log scale or the data set the training data set size log scale and then the loss you know basically the perplexity on the y-axis and so you see these very nice linear you know log log relation so a power law which depending on what you're looking at tends to be around the 10th and so this very much encouraged people to think that we just have to scale the model up make it bigger and it will do better and this was important because GPT-3 with especially the bigger one has 175 billion parameters and the first training one took hundreds of millions of dollars to run so this is why in 2019 OpenAI went to Microsoft and got this billion dollar investment and injection of a cloud computing time to be able to pursue this research and so here's some examples now of GPT-4 again especially many people in the audience I'm sure have played with this and some of you may be working on these things this is an example of writing a program and there are lots and lots of examples of all the simple programs already on the web so I tried to choose this is just directly at a GPT-4 an example that probably wasn't previously there on the web write a Python program to compute the first four moments of the sequence of the first 100 prime numbers using the CV there it's often this and so then you get a pretty good program I actually didn't run it yet but having looked at it it has the right all the right ingredients it's conceivable it has a mistake or two someplace but there's the CV of Aristophanes there's the first 100 there's the computation of the moments and so it's able to put all that together this is this auto formalization that I talked about is can you take a statement in natural language and turn it into logical statement, logical proof and in particular this is an example of an international mathematical problem this is from Paul LeWitt, Al 2022 it's a relatively simple problem for the IML prove this inequality at the top which you can prove using algebraic manipulations and then this was the output of their model mostly pre-trained standard it was GPT-3 then subsequently trained some more on formal math and the like GPTF and at this point it can solve two out of 20 but more recent work it can solve something like 12 out of the set of 20 IMO problems I think GPT-4 can do that this is from Levkovitz et al this is from middle of last summer this is the Minerva system developed at Google again starting with the Google palm language model and then training it intensively on the archive and other math and physics courses and so you give it the question up there a line parallel to y equals four x plus six passes to five comment and what is the y coordinate at the point where this line crosses the y axis and then it produces a correct answer with the correct reasoning and for this kind of again some set of early undergraduate math and physics problems that can solve 50% of them at this level and maybe another 20% where it makes kind of little mistakes but on the other hand of course if you've tried to push this you know very well that at present these things are quite limited in their ability to reason and it's really not totally clear what this means and from some bigger scheme I think okay so let me stop there and I'll just try to summarize and draw some kind of predictions for 2030, okay so a prediction which I even longer version of the talk spend some time on is that there will be textbooks that really integrate the use of computation much more in ways that help the students you know the students, yeah it's not that there are many textbooks that use computation and tell you how to compute given the knowledge of subject but something that brings them together in a way that will be an obvious advance over the way we teach now. This field of computational mathematics obviously will continue to get better easier to use. I predict that in less than five years this interactive theorem proving to lean system in particular with the help of AI and perhaps other tools will be relatively easy to use it'll be sort of comparable to computer algebra Mathematica, Naples and so forth now where you can just pick up the book and try it out on your problem and not have to spend a year studying it to do anything and at that point it will start to become a much more widespread tool. I think on that same time scale there will be a translation more or less the whole archive not 100% reliable not with all the proofs but enough of it into formal mathematics to have that as a resource for this kind of a research in particular for mathematical search it'll be semantic mathematical search where you can not just type the name of a theorem or the exact text of what you're looking for but a definition in a much more conceptual way or you could tell it a set of equations or inequalities and it would find something which we would say is directly relevant to that even though the text doesn't match at all and the test of that will be that you have your question and you type it in and it tells you a result from some very different field of mathematics which is not obvious as relevant but indeed turns out to be quite relevant so that's a test that has not yet been passed but again I'm imagining within five years at this point okay now finally will computers have invented or proven any major result by themselves so that's I think a clearer test than saying some vague question of human level capability obviously that's the age old question and I made a prediction here also around 2020 which is that this is a slightly nuanced prediction it's to say well okay math reasoning is comparable to reasoning in other human domains it's not fundamentally easier or harder it's a kind of reasoning that we do it throughout life in many professions and in fact math is clearly not the biggest priority because you can make much more money reasoning about other domains and you can do much more good for humanity for example by producing a medical system or a biomedical research system and so forth and they may have more data the great advantage of math from many points of view but in particular from this one is that you can make an arbitrarily long chain of reasoning with as many steps as you want and if they're rigorous then the result will sequence of a true production will be true and that's not true in most areas of even science you know you eventually run into the limits of prediction the limits of real world correspondence with your language and so forth and so if the computer gets a reliable ability to make steps of reasoning then that's what's required to be able to make arbitrarily long chains of reasoning obviously if you can only have 90% reliability then already 10 steps typically one of them will be wrong but so you need very very high reliability here it exists of course because the theorem proving systems the SAT systems and so forth that I described are reliable on that scale but they're not general they have to program and encode what you're talking about into those systems but somehow if the kind of technologies we're talking about here in language models could gain that type of a reliability then that would be the breakthrough that would make this possible and then the prediction I made is that and that briefing has not happened yet clearly but that 10 years after that and this may be even pessimistic we will see computers at very least human level and that's by analogy to figure out where the breakthrough that made that possible was to Monte Carlo Research which was introduced and recognized to be quite an advance in 2006 and so perhaps in the coming year five years or whatever the language model people will make this step with the language models and then the prediction is that 10 years after that we will see human level performance in mathematics so let me stop there and thanks for your attention Thank you very much Michael it was really fascinating let me just add a piece of information that I think is even more important now Michael is also part of the ICTP advisory board and after this talk I think it's even more clear why your input is so important so really thank you so if there are any questions, please, thanks Yeah, I have two questions the first is there are certain claimed proofs in mathematics about which experts disagree so I was wondering how far are we from that stage when AI will settle this question one way or the other and the second question is to what extent machine learning will help us identify generic features of the string theory landscape I'm asking this because you have thought a lot about this particular question Right, okay well I mean the first one of course people could take a given point of dispute and you could formalize it in the way talked about here and let the computer without any AI and in practice the mathematicians really do come to a consensus so this is most mathematicians I talk to don't consider this to be like a high priority to develop the technology for that reason so like for example, the, you know the Michizuki's ABC conjecture claim proof most of the mathematical world does not believe that proof for reasons that were pointed out by Schultz and others, you know and so that tends to be where such disputes end up but it's certainly true if there were a dispute that went on for longer, the technology's already there and then as I said, I believe within five years right now it would be a big pain to take in real mathematics although again, a project one can work on Kevin Buzzard, I believe it as I see him I'm actually certainly noticed has proposed working on a Fermat and doing much of Lyles-proof formalizing not the whole thing from the foundations but from more or less the status of like 1990 formalizing the subsequent developments, 1980 maybe and so it is a usable thing but I did predict that within five years it will become very usable you know, the product comparison would be like Mathematica and then I think unfortunately the string landscape is a harder problem because we don't have a very precise definition we don't have a foundational definition of string theory we have all sorts of tools that we kind of bring in in unsystematic ways and I think to be honest there are related problems that will be solved first in fact I predict that just some sort of broad understanding of the bootstrap in the space of conformal field theories that is a thing which could come within five if you have to misstate five or 10 years maybe with computers helping us because that's a much more well posed problem and then that's only bears on the string theory one yes yeah thank you so I will ask myself a question I may the first one is I find that a very nice and optimistic picture and what do you think would think change if we start to put in the let's say in the balance the fact that we're experiencing a lack a scarcity of raw materials to produce chips that training these large models have a tremendous environmental and energy cost so how do you think things will go will this increase in the exponential power this exponential curve in the power of neural networks will follow this trend if we take that into account yeah that's a good question I mean obviously an exponential cannot go on for very much longer I mean it's always becomes a logistic curve and we can't be that far from the inflection point I also don't believe I mean there's like an extreme interpretation of these scaling curves that I put up to the effect that you know you just you know keep scaling out and you eventually get some human level or superhuman level of intelligence and most people don't believe that and I don't believe that and on the other hand there are many proposals out there about what's missing from the current language models and what you should put in I mean one reference I can put in a very concrete thing is Yann Lecaz I call it NANFSTOBA from a year ago it's easily available online he has a list of maybe five ingredients which are missing that people could put in so I don't think the advances necessarily depend on having exponential growth of the computing power I think that probably that that phase will only go for another couple years in fact the thing which we're clearly also running out of is data so in terms of natural language on the internet that's pretty much been used and so it's you need both things we need the scaling so you need arbitrary model size and arbitrary data set size so that's another reason why techniques and concepts need to advance as well but they do seem to be advancing so do you think the trend is going towards smaller models or? yeah right now I don't know I mean that's a very kind of I mean people certainly have had success duplicating you know what performance of a big model with a smaller model and we don't understand it well enough to make principled predictions so and then I mean I mean of course that's a thing which many more people can work on is to train a smaller model that's one reason and you know you know certainly my if I had to guess I would guess that you don't need so so GPD4 is claimed to have 1.4 trillion parameters and you don't need that much to get that performance I was hurt and why I thought most people would say that but exactly how you make the model smaller then I'll ask my second question I'm curious to know how can we guarantee that let's say such languages that you presented like lean or that are giving formal proofs are actually reliable especially in cases where maybe the proof is not human readable or it's you just get a yes no answer how can we get formal proofs that these are well that that that starts to get into much more developed areas of computer science I mean I think the first step is to be absolutely convinced that the verifier is correct you know there are many ways people do that I mean people you there's a short enough that you know you can verify using another verifier even and there are there are examples of that being done again and with enough kind of cross verification enough results because the thing is not that long you can become convinced about the verifier itself and you know then making a long run of course you can repeat the run many times and if each run has a small failure probability and of course repeating is an effective way to make that probability become arbitrarily small so I I mean I think it becomes more interesting when that you know the use of the formalism is in some way ambiguous or imprecise and then there's of course much more of a question of what does it mean can you really be sure that this formalism reflects and that's the that tends to be the big problem with verifying programs making a accurate specification is just as complicated as writing the program you know it's different so you still gain something but it's you know that's that's one reason that the verification is not you know besides the the you know the skill required to write the proof there's the skill required in doing the formal doing the specification but again I think you know there's a lot of talk about how language models are going to change how we think about all these things and in fact the language models may be far far greater motivation to make verification into your improving work than we ever had before because of course before we could say you know you know it's it's it's your job to make sure it's right and you know you stare at long enough and find enough tests until you think it's right but you can't say that to the language model but it's still so tempting to to use it as you would like some way to verify the correctness of what it's putting out and at least for you know math and to some extent for programming one can do that so I think I think for that reason the verification will become a much more interesting topic yeah thanks for the great talk and you were like emphasizing the first that the first step should be the verify verifier I'm wondering like not only in the next seven years but also like let's say in 20 years or so like do you think that there will always need for any human to check after the verifier or like a we could totally trust the language model for this well it's really a little hard to predict you know it is starting to get into this you know unknown realm of I mean again it may not be super intelligent in the sense that it's doing something no human could do but super intelligent in the sense that it's doing what would take a human a year and it's doing that in you know ten seconds you know and then right what do you say I mean I mean I don't see the obstacle in principle to having a combination of language model and verifier and perhaps you know related but you know not some sort of magical you know technologies we already have that is so reliable that people would just trust it I don't I don't think there's any in principle obstacle to that when that will happen I would certainly expect it within 20 years and it could be five to ten my question is going a bit in the same direction and it's coming more from my relatives relatives and then neighbors so do you think that we need to teach those kids math and programming or coding in the next let's say in 20 years going onwards doing this kind of development right you know I I I do but for the same reason that we used to teach you know all the kids math you know and even if most people don't wind up using the math in their profession or their everyday life it's still a very it's a great way to learn to think you know it brings a different perspective on many aspects of life then you know I mean you know literary art you know many many perspectives and one should learn you know a variety obviously you know but math is definitely one of them I think that should be the foundation of education now whether you should count on getting a high salary as a programmer in 10 years probably there'll be people you know people who do but many fewer than them now Is there a last question? Not let's thank again Michael for this