 and let me welcome our speaker today, who is Max Tecmark. He's a professor of physics at MIT and a cosmologist by training. He has contributed significantly to our knowledge about the universe at large, developing methods to analyze the data from cosmic microwave background experiments and large-scale galaxy rate shift surveys. Max is a deep thinker and curious not just about the cosmos, but also about consciousness, artificial intelligence and what it might mean for the future of our species. He's the author of two books, Our Mathematical Universe and Life 3.0, Being Human in the Age of Artificial Intelligence. And how the lecture is connected to the former is going to be the topic of his talk today, which is called AI for Physics and Physics for AI. Max, please, whenever you're ready. Thank you so much. It's a real honor and pleasure to be here today to see 130 people in the audience. It really makes my morning here in Massachusetts. I'm gonna start by sharing my screen. Host has disabled participant screen sharing. So I'm gonna have to beg you to give me the power of screen share while you go, oh, there we go. Now it works. So Oxford has a very special place in my heart actually, because when I was a postdoc, the first time I got invited to go visit somewhere else, I was invited by George of Sathieu, who was an Oxford then to come to Oxford. And it was all super exciting, like things always are when you do the first time. I had a wonderful time. I've had the pleasure of coming to your beautiful city many, many times. My PhD advisor, Joe Silk, later also came to the conclusion that Oxford was awesome and moved there from Berkeley and spent many, many years there. So I feel like I'm coming home today. I'm gonna be talking about AI for physics and physics for AI. And when I say AI for physics, I mean simply what can AI do for us physicists? How can we take tools from AI and use them to do better physics? When I say physics for AI, I mean, how can physics give back? How can we use ideas and techniques from physics to actually improve AI and machine learning itself? And so a few years ago, I switched over my MIT research group to focus mainly on machine learning. So I love the interface between the two fields. And we have a lot of fun here at MIT working at the interface of AI and physics in my group. You see that if you are an undergraduate or a grad student, consider applying to grad school at MIT or a postdoc here because we really cool if the person in the lower right corner here is you. And it's not just in my group that we have a lot of interest in the physics AI interface. We just got funding from the National Science Foundation for this $20 million center on AI physics interface. It's called iFi. You can go to iFi.org. And we have a lot of awesome faculty, not just here at MIT, but also at Harvard and our other sister universities here in the area. We're gonna be focusing very hard on this for at least the next five years. So this is, again, promotion for you guys who are looking for your next career step to consider coming here because this is gonna be a really fun area to work on these things. I want you to feel very, very welcome. So why should we care about AI? Well, obviously because the power of AI has grown dramatically recently, opening so many opportunities. Just think about it. Not long ago, robots couldn't walk. Now they can do back flips. Not long ago, we didn't have self-driving cars, right? And now we have self-flying rockets that can land themselves with artificial intelligence. Not long ago, AI could not do good face recognition. And now it can simulate Philip in the short first face saying things that he never said. Not long ago, AI couldn't save lives by doing medical diagnosis. Now it can do diagnosis of prostate cancer, lung cancer and many eye diseases as well as a human doctor can. And I think the greatest potential to save lives through machine learning and medicine is to actually use it to accelerate medical science itself. For example, one of the oldest unsolved problems in biology is the protein folding problem, right? Where you're given the amino acid sequence from the genetic code, and your task is to figure out the shape in which this will fold up to make a protein. And the annual contest for doing this was recently won, surprised by an AI from Google DeepMind just down the road from you in London. Here you can see it figuring out the shape that this particular one folds into. And here you can see how other 3D predictions in blue agree quite well with a very costly and slow x-ray crystallography measurements. So this is an area where there's a lot of work left to be done, but it's pretty obvious already, I think, that this will eventually help us a lot with drug development and drug discovery, for instance. If you played the board game of Go, paid attention to it, then you remember very vividly when Google DeepMind changed the fact that we couldn't beat this game with AI, with their AlphaZero AI. It took 3,000 years of human Go games and Go wisdom and took all that big data and threw it in the garbage can and just played against itself for 24 hours and became the best Go player in the world. And the exact same machine learning algorithm also became the world's best chess player by playing against itself for four hours. And the really interesting thing here, the most interesting thing to me is not that DeepMind's AlphaZero crushed human gamers, but that it crushed human AI researchers, people like myself and others who've spent decades just trying to handcraft algorithms for this, all made obsolete. So if AI is getting powerful like this, surely it can help with physics in a lot of ways too. And yes, it absolutely can. In fact, I know a number of you in Oxford are doing machine learning in aspects of your physics. Here in my MIT physics department, a significant fraction of the professors are using machine learning for something. I actually predict that 10 years from now, physicists who don't do anything ever with machine learning are gonna be about as common as physicists who never use calculus in their work. It's just gonna become a tool that becomes ever more ubiquitous because it's so generally applicable. For as a few examples of what we do with it here at MIT, we use machine learning to better detect the gravitational waves coming from colliding black holes, halfway across the universe. We use machine learning to better detect extra solar planets and other solar systems. We use machine learning to better figure out what's going on in particle collisions at the Large Hadron Collider. We use machine learning to better do neuroscience and figure out what's happening in brains. And we're even, also in particularly with Martin Solzacic's group, using making better hardware that can, based on physics, that can help with to give back here into the machine learning community, making special optical chips, which are more, which may be a million times more energy efficient than the traditional ones for doing machine learning. That's one, so better hardware is one way in which we physicists can help the machine learning community as a way of saying thank you. But there is so much we can do also on the algorithm and software side, which is what I wanna spend the rest of this colloquium on. Why does the machine learning community need any help? Well, for one, of course, they are always grateful if we can give them better algorithms that perform better, learn faster, and so on. But another challenge, which I think is important is that, machine learning is often used as a black box that we have very little clue how it works. And that can cause challenges. For example, we had this very widely sold package in American courtrooms that made machine learning recommendations for who should get probation, who should stay in jail, until it turned out that people had understood how it worked so poorly that they didn't realize it was severely racially biased. If you're a company, you do not wanna be the next Boeing who installs an automated system that's so poorly understood, despite its simplicity that two of your aircraft crash and kill hundreds of people. You do not want your company to be the next night capital that goes ahead and deploys an automated trading system, which was so poorly understood that it lost $10 million a minute and kept doing that for 44 minutes until someone must have said like, wha, stop, shut this damn thing off. You don't wanna be the next Yahoo who gets every single email account hacked or the next Equifax who gets all your credit card and information hacked. And even though in these hacking examples, people's knee-jerk reaction is to just blame some sort of evil hackers, the embarrassing truth is that hacking is only possible because you haven't understood your system well enough, again, to know its vulnerabilities. So what has that got to do with us? Well, I think physics can help a lot in demystifying the machine learning black box and giving, and that's what I wanna talk about now. So as a warm-up example, let's take something that DeepMind did about six years ago. They trained a simple machine learning algorithm, DQN, to play this Atari game and others. And you can see in the beginning, it plays terribly. It's given as input just the numbers that represent the colors of the pixels, and then it's just told to maximize the score and has no clue what a game is or space or balls or anything. But after a while, it becomes, it looks to us pretty intelligent. It figures out this really cool strategy, et cetera. So this is a little bit of artificial intelligence here. But how does it actually work? Do you have any reason to trust that it will always work reliably? The way it would if it was your son or daughter who had figured out this strategy. Well, let's take a peek inside. We just ran this in my lab and so we can see what's inside. And what's inside is you take the pixel colors and you take those numbers and you multiply them by a gigantic matrix and then you apply an own function to it and then you multiply it by another matrix, et cetera, et cetera. And these matrices are defined by about 900,000 numbers, some of which are shown here. Is it now crystal clear to you how this works or whether you should trust it to always work? Of course not. This is completely useless. It's powerful but utterly inscrutable black box, right? And it's fine if it's inscrutable, if it's just a game but if you put machine learning in charge of people's lives by telling the judge they should stay in jail or by flying an airplane or something else, you really wanna have a higher bar of understanding it. And this is what we're focusing on in our group. Our slogan is intelligible intelligence. The vast majority of the recent progress in AI has been exactly in machine learning, right? Often powered by neural networks that you train in the form of these inscrutable black boxes where you have very little clue as to how it actually works. And our focus is to do exactly this as step one but then not to stop and publish or ship but see if we can develop additional automated techniques that will demystify the black box and find an equally powerful algorithm or function that is just easier to understand. The ultimate simplicity would be if we can get some analytic formulas that we can then study with our physics methods to see how this thing is gonna behave. And I'm gonna discuss three examples from recent papers of this type that we've done in our group. One is where what you try to auto discover are actually equations. So that approximate the function that the neural network implemented. The second is where we try to auto discover conserved quantities. And the third one is where we try to auto discover which degrees of freedom are even the most interesting ones to pay attention to in the system. So let's start with auto discovering equations. This is known as symbolic regression. Symbolic regression is just, sorry, this is a paper we just got accepted to the New York's conference. Symbolic regression is simply finding a function in symbolic form that approximates a data table. So our task is to predict the last column of the table from the previous ones. If the function you're looking for is a linear function, then this is just linear regression. And this is something that I'm quite sure the vast majority of you have done many times. It's as easy as solving a system of linear equations. It's super useful. But if it's a more complicated arbitrary non-linear function, then this is obviously a very hard problem. We know that Johannes Kepler spent four years trying to do symbolic regression on data on Mars's orbit until he finally, aha, it's an ellipse. No, revolutionizing physics. How do we automate things like this? And if it's a function of many variables, so we have a hard time plotting it. This is generally believed to be NP-hard. And the reason why is pretty obvious. If you take your symbolic expression and you just decide you're gonna try the simplest one and then the next simplest ones and you keep trying, trying, trying, there are exponentially many symbolic expressions of length N, just like the number of passwords that you could pick of length N grow exponentially with N. So what you find is it's even when you get up to sort of medium complicated things like Planck's black body formula, it takes about, can take more than the age of our universe to just try some sort of brute force search here. So a common thing in machine learning is we've seen that there are many cases where you can prove to yourself that the generic problem is hopelessly hard, but that the problems we actually end up caring about are a lot easier. The traveling salesman problem is like that. And I'm gonna try to persuade you that symbolic regression is also like that. How are the kind of equations we like to discover in science easier? Well, they're very often modular. So graph modularity, returning to the title of our paper here, to unpack it refers to the fact that if you take the computational graph that defines the formula and you write it out like this as a tree, then it's modular. In this case, I'm showing you a function of three variables, which can be decomposed as two functions of two variables. And for some reason that we don't understand, it so happens that the vast majority of formulas that we actually care about in physics and chemistry and other areas of science tend to have a lot of modularity in them. So we shamelessly exploit that in this paper. What we do is we train first a neural network to approximate this mystery function. We don't know what the function is, right? We're just giving a data table. We train a neural net to pretty accurately approximate F as a function of the input variables. And then we prove that if the thing is modular, there will be certain properties that the gradients of this neural network have that you can identify. And we prove that if you do this recursively, you will actually discover the entire modular structure, no matter how complicated it is. So what we actually do then is we start with the mystery function of a lot of variables. We keep recursively simplifying it into smaller and smaller modules until they become so small and simple that we can figure out what they are by brute force, polynomial fitting, or some other technique. And then we put the pieces together again and get out the whole function. So you can think of this as a recursive divide and conquer strategy of taking a hard problem and breaking it into successfully easier ones. Bearing in mind that, you know, whenever you can eliminate one variable, the cursive dimensionality gets better. You're fighting against the exponential. So let me give you just one concrete example of this. So suppose I have a data table showing how the force, the gravitational force depends on the nine variables you see in the upper left corner. But you don't know the formula because Newton hasn't happened yet. Okay, one is to have machine learning do this. First, our code will automatically do dimension analysis and simplify this down to just six dimensionless variables which I'm calling A through F here. And then it trains a neural network to fit this function. And it discovers, wait a minute, there's a modularity here. The only way in which depending on the variables C and D is by a certain combination of them which then can discover as a minus, a subtraction. Another in physics speak, it's just automatically discovered translational symmetry for the C and D variable. So then now it takes those columns for C and D and replaces them by the one column which is the difference of them. So now it has a new problem with only one variable less. Go back again, recursively train a neural network and now it's gonna discover another translational symmetry and other graph modularity and replace E and F by, in this case, their difference. So now it's down to just four variables. So now it has a table with five columns, four input, one output. Again, trains neural network to approximate this. Now it discovers another kind of modularity. It discovers actually this function of four variables is just the function of one variable times the function of three variables. In this case it's what we call multiplicative separability but this works for any kind of modularity. And at this point, the pieces are simple enough that our other techniques are able to find what they are. We put it all together and we get the solution. We had an earlier paper where we, which we published in Science Advances where we only were able to do this for certain kinds of modularity like symmetry, simple symmetry and separability. But in this new work, we show that this can be done for any modularity whatsoever, for any functions of any number of variables. So if we look at this title, Pareto optimal symbolic regression exploiting graph modularity, we've unpacked what symbolic regression is and what graph modularity is. So what's the deal, this Pareto optimal thing? Well, what we found very useful was to not just pay attention to how bad the approximation was of the fit. In other words, the inaccuracy or loss but to also pay attention to its complexity inspired by Occam's razor that says, if you have two things that are equally accurate, pick the simpler one. So if you think of all possible approximations you can make to the data, you can put them in this plane. We kill off everything except what lies on the so-called Pareto frontier. We only keep for any accuracy level the simplest one. And what you see here that our code automatically discovered in this case is, so we're particularly interested as scientists in the spirit of Occam's razor by convex corners. Formulas that are both unusually accurate and unusually complex. And there are exactly two of them here. First, we have in the lower right corner, what turned out to be the exact answer. We gave it kinetic energy data and it discovered automatically Einstein's theory for relativistic kinetic energy. But you see it also discovered another formula, MV squared over two all by itself. And realized that this one is a very cool one because even though it's less accurate than Einstein's one is much, much simpler. And so if you have some data where you don't know what the formula should be, this can give you a suite of useful approximations. We based this plane here and this Pareto frontier on complexity theory. You can see that the x-axis and the y-axis are both labeled in bits of information. And the way we compute, so the x-axis is how many bits you need to describe your model, your formula, and the y-axis is how many bits of information you need to describe all the prediction errors that you make. How do we do that? Well, we did it as an easy to compute compromise between on one hand Occam's razor, which is just too vague to compute mathematically. And on the other hand, Ray Solomanov's algorithmic complexity theory that's just too hard to compute and perhaps NP hard. What we did it here, you can see in the table is we just came up with a relatively easy to compute scheme for an approximate complexity measure in bits of natural numbers, integers, real numbers, and so on, and even functions themselves in symbolic expressions. And when we put all this into our algorithm, this is what happened. Let's look at our results. So to test this, we generated, first of all, four terabytes of tables of numbers by taking the 100 most famous or complicated equations from the Feynman lectures, some examples shown here. Then we also took 20 more difficult equations from various graduate textbooks that I had lying around the office. And then we took 12 quite hard equations to see how good our graph modularity discovery would be. And then we also took a bunch of probability distributions where we didn't even have evaluations of the function where we just had a bunch of n-dimensional points in n-dimensional space. And to convert them into this form that our AI Feynman software could tackle, we started by doing a so-called normalizing flow. You can ask me about that afterwards if you haven't heard about them, which gave an approximate approximation to the actual function that we then applied our symbolic regression to. And this is what happened. The state of the art in this field for the past more than a decade has been the Schmidt and Lipson genetic algorithm. And you can see that we were able to do substantially better than that we got the new state of the art here. We could solve 100% of those Feynman problems, 90% of those graduates harder ones, 100% of those modular ones, and 80% of the probability distributions, largely because of the modularity and other techniques I mentioned. Another thing that was really cool was that this information theory thing, doing it all with bits, really increased the robustness of our method a lot. So whenever we were able to solve and find the exact equation, we then repeated it by adding ever more noise until we broke it, until our algorithm failed. And we found that we typically could increase our noise robustness by adding this Pareto frontier stuff by two to three orders of magnitude, which is nice because of course, a lot of real experimental data has noise and a lot of numerical simulation data has noise also. My three words summary of the symbolic regression for those of you who wanna use it on your own data is pip install AI Feynman, because that's all you have to type on your computer to install this stuff. All right, so let's come back to the bigger picture. We're talking about making an assault on the black box on a neural network, for example, that can do something, but you have very little idea how. We've talked about how you can find Pareto optimal symbolic approximations to it. Let's go now to talk about finding the useful degrees of freedom. So what do I mean by that? And why am I calling it progression? Well, so in this paper, we talked about how he sort of cheated in the earlier thing I said here, like if here's an example where we have a quartic oscillator moving around and if we just measure the X and Y coordinate oven and plug it in and compare it with other things, we can discover an algebraic equation, a motion like this in terms of the derivatives, but we were the ones who had to tell it what we were gonna put in, that we were gonna put in X and Y coordinates. That's not what we actually do in physics, right? If you're a Galileo and you're looking at something, you actually just get an image on your retina, and a big part of the brilliance of Galileo and other physicists was to figure out which aspects of all these thousands of numbers that represent the pixel brightnesses are the ones that are actually relevant for predicting the future. In this case, so what we did was we made these videos of a rocket flying around on some silly background. For example, this one is moving in a magnetic field, so it goes in a circle. We wanted to know, okay, can we just feed in the raw video and have the machine learning thing, not only predict the next video frame, predict the future, but figure out that it's completely irrelevant to pay attention to the color of the sun or the shape of the palm trees or the fact that the rocket is red, that all that really matters are X and Y of the rocket. It's a highly nonlinear function of the roughly 4,000 pixel numbers to get X and Y out, right? Could it discover that automatically is what we wanted to know. And you might think this is kind of easy. It doesn't feel so hard to pull out a ruler and measure the X and Y coordinate, but it's actually very hard because the way I've stated the problem, it's so general that it should work equally well if I put this crazy distorting lens in front of this. We can think of any video frame here as just a vector in a very high-dimensional space whose coordinates correspond to the three colors of all of the pixels, right? So in the high-dimensional space, this point is just moving around in some curve. And we're trying to figure out how can we map this into a low-dimensional latent space? We as physicists know that it should be enough to go into a two-dimensional latent space to capture this so that the laws of physics get simple. Now, measuring the undistorted X coordinate is actually even more complicated because you have to figure out what this lens is doing and so on. So the good news is we managed to get this work, but it took a whole year and I wanna share a little bit about why we failed so much and how physics came to the rescue. So what we did was we created a neural network with this architecture here. And by the way, I'm calling this all pre-aggression because it's the pre-processing before you can even do symbolic regression is to figure out what degrees of freedom should you pay attention to and try to do symbolic regression on. So here's what we did. We trained the neural network to map the high-dimensional image space into a low-dimensional latent space, two dimensions or five dimensions or whatever. And then we trained another neural network to map back to the image space again. Those of you with machine learning backgrounds will recognize this as an autoencoder. And then we trained another network to just try to take the previous two points in the latent space and get the next. Why two? Well, because the laws of physics are second order, but you can easily generalize. The problem with this is, as we describe in the paper in detail, that just like in general relativity, you have this annoying reparameterization invariance because if you found the solution to this, then you can easily build much more complicated solutions also that are equally accurate by just doing some sort of invertible mapping of the latent space. We didn't want that. We wanted to find this natural latent space that physicists find, like an inertial frame. So we wanted to find the encoding into the latent space that gave us the simplest laws of motion. William Ockham back again. How do we define simple here? So I started thinking about this guy. Einstein told us that a space is simple. If it's curvature is small, right? The smaller the better. So first we thought maybe we can just minimize the induced curvature of this mapping where the Jacobian defines the tetrads, et cetera, et cetera. But then you still have to take a lot of derivatives of your neural network and it runs a bit slow. So we decided to try a crude approximation to this first. The induced metric here is just the Jacobian times its transpose, Jacobian of the U, right? And if U is linear, right? Then the Jacobian is just constant. So the G in the lower F corner is constant. And that means those crystal ball symbols will all be zero and the Riemann tensor is all zero and the curvature is all zero. So we thought maybe we can just try something simpler first and see if it works by just penalizing non-linearity in the Jacobians, penalizing derivatives in the Jacobian, adding that to the loss function. And that actually worked really well. We were able to, it was automatically able to take all these different kinds of rocket videos and different laws of physics and always map them successfully into a kind of latent space that physicists would do, corresponded to some sort of X and Y coordinates. And then we fed this into AI Feynman and it discovered exactly the laws of motion that we had put in. But it took us a very long time because in the beginning it just failed epically with this. What kept happening was even when we had very simple motion, like the rocket was going in a straight line with a uniform speed, the latent space would look all sort of tangled up like this picture with a Cheshire cap here in the middle. And then we eventually realized that what's going on here is if you, in this very high dimensional, many, many thousand dimensional space where the trajectories are actually happening, right? There's this two dimensional hypersurface that the rockets can be on because they're really only two degrees of freedom in these videos. It's like you have, so I was thinking, it's like you have a towel in your hand crumpled up in some weird shape in 3D space. If you just randomly plop it down on the floor, saying I wanna map it into two dimensions, the two dimensional latent space, the floor, it's gonna have a lot of folds in it, the towel, right? And then in order to make it simple, you would have to undo those folds somehow to straighten things out. But the loss function we have doesn't like that because whenever you take two trajectories and cross them, then that means that there are two images of rockets in very different places that get mapped onto the same point in latent space, mean making it impossible to invert and figure out what the picture is. So effectively, it's like all these trajectories repel each other with a Coulomb repulsion in the picture in the middle bottom. So we thought, well, there is this famous theorem and not theory that you can't tire shoes in four dimensions. And so we thought maybe we should go to a higher dimensional latent space. And the not theories theorists have worked out that if you're embedding a two dimensional towel and whatever, then the equivalent thing is you need to go to five dimensions to not be able to automatically untangle everything. For any number of dimensions and not theorists will give you a formula. So we found that if you just start with a five dimensional latent space instead of two dimensional, then it actually finds a really nice embedding. And when it's finished finding the simplest one, we can do a principal component analysis of this and find that it's actually all lying in a hyperplane in 5D. And then the principal component analysis tells us how we can just map that back into a 2D latent space. So this is how we actually found all these by adding a few extra dimensions to solve these topological problems, get through the energy barriers. And at the end, we were able to get, as I said, all of the equations and motions out. And we were even able to, which you can ask me afterwards about, if you'd like, rediscover inertial frames and find that of all the degenerate linear ways of parameterizing latent space, the one that minimized the complexity was a good old inertial frame, which is isotropic and so forth. But I don't wanna talk more about that now because I wanna spend the last minutes of my talk here about the third app on the third application, conserved quantities. So if you are just given a bunch of data on, let me back up a little bit. This paper is with my grad students, Yiming Liu. And if you start with a dynamical system like this and you're just given data from it, like in this case of a double pendulum, I give you two angles at a bunch of different times. But I don't tell you what the laws of motion are. Then you can use the AI Feynman techniques that we discussed to try to figure out what the laws of motion are or what the Hamilton. And then you can also use, if I just give you this in video form, the progression things, try to figure out that it's those two angles you should have paid attention to rather than the colors of the dots or whatever. But as physicists, the Hamiltonian or the equations of motion aren't the only things that are useful for us to know, right? It's also very interesting to know if there are any conservation laws and any conserved quantities in there. Famous physics problem of a Poincare did some fantastic work on a century ago, right? How can machine learning just automatically discover this? Poincare and all those after him have tended to discover what's conserved and what's not by knowing what the equations of motions were in the first place, right? But we want to take a data driven approach. Just give it a bunch of data from a physical system or a neuroscience system or whatever and see are there any conserved quantities there. So the way we go about this is we start by thinking about the phase space. So in this case, there are two angles, theta one and theta two to define the configuration space. So the phase space has four dimensions. We can think of the point moving around in this four dimensional phase space. And if you let it go for a really, really long time, the trajectory will form a dense set in some sub manifold of four dimensional space, right? And if there is a conservation, if there's no conservation laws at all, then the ergodic theorem says that it's gonna fill up all of phase space. In all of these cases, we have energy conservation. So we lose one dimension from there. And for every additional conserved quantity, you lose one more dimension in that manifold that things are allowed to go on. So what we actually wanna do with machine learning is just give it this trajectory data and have it figure out the dimensionality of the manifold that the thing is moving on. And then if we know that dimensionality, which I'm calling D here, then we just take the number of dimensions four in this case, subtract off the dimensionality of a space and get the number of conserved quantities. This is interesting and challenging. You can see from this picture because it actually depends on the initial conditions even. It's well known for the double pendulum that the only thing that's conserved in general is energy. So both for the chaotic and other non chaotic, non linear case, you actually move around on a three dimensional manifold and there's only one conserved quantity. But if you start with very small oscillations like in the upper left, then you can linearize a system and it decouples into these two separate normal modes. And then we know from just doing the math analytically that the energy is separately conserved for both normal modes. So now you actually have two conserved quantities and you're in a two dimensional space. In the lower left, you see a periodic orbit. The periodic orbit means that the curve that you're going through on phase space actually ends up closing back on itself. So the whole manifold in phase space is just one dimensional curve, D equals one. So there are three conserved quantities. How can you discover all this automatically? How can you discover the dimensionality of the manifold that you're moving around on? Well, what we did was we wanted to test this on five different systems that I'll come back and talk about here in a bit. And we first train a neural network to try to learn something about the manifold. We train a neural network, we take each data point and we add some random noise to it. And then we try to train a neural network to push this point back onto the manifold with a loss function which encourages this. So the neural network now has some intuition as to the shape of the manifold but we have no clue how it works. It's just an inscrutable black box again. But now what we do is we use this to try to extract the dimensionality of the system. And we do this by plotting what we call the explained ratio diagram here. So what we do is we basically just take a bunch of points and on the manifold that we generate and do a principal component analysis of it. If everything were perfect, then if you look at a small part of the plane of the surface, the surface will be planar there. So if it's two dimensional and you get two eigenvalues that are big and the other eigenvalues will all be small. So now you've learned that it's a two dimensional manifold. If you go to very small scales then the noise is gonna dominate you and you're just gonna see something fairly isotropic. So the number of eigenvalues, all the eigenvalues will be about the same. And if you go to very large scales, it's not flat anymore. So this is in general a shape that fills up with the same dimensionality as the whole space. So in other words, so what we do is we think of dimensionality of the manifold as a renormalized quantity which is a function of the length scale. How far are the noise moves your points around? And what we always find is that in the limit, the very large length scale and very small length scale, the manifold always looks like it has the full dimensionality of the phase space. But in between there's a magic region where you can see how many dimensions it actually has. So let's look at how that plays out in these examples for the simple harmonic oscillator on the left. It discovers that there's one eigenvalue that becomes tiny. So there is one conserved quantity, the energy. For the Kepler problem in 2D, we see that there are three eigenvalues that become tiny. So we've learned that there must be three conserved quantities. We knew, of course, from way back when that this is the energy, the angular momentum and the Runge-Lenz vector. In other words, the angle of the major axis. But it just ought to discover this by itself. For the double pendulum, it finds generally, there's only one small eigenvalue, the red curve there. That's the energy that's conserved. And for the magnetic mirror, it again finds that only energy is conserved. We also did the three body problem where it first discovers four conserved quantities at the linear level, which are the total momentum. And if that's zero, the x and y coordinates of the center of mass, and then it discovers two more. And then once we discover these, we also throw this into AI Feynman. And we can actually discover, for many of these, the symbolic expression for a conserved quantity, which we thought was quite fun. And let me end by talking about how you can use this quite blindly to mess with systems and discover also phase transitions and other interesting things about them. So for example, let me take this double pendulum again. Here, we already talked about briefly how the number of conserved quantities depends. For small oscillations, it's two. For the periodic case, or it's one, and sorry, it's three conserved quantities. And in general, it's only one. Let's look at the Kepler problem. In the middle, we have the classic Kepler problem, inverse square law, three conserved quantities. But then we decided to mess with this. We changed the two, and the exponent of the inverse square law to two plus epsilon. Now you start to get some precession going. So the longer you wait, the more it's gonna process. And by the time it's processed a large amount, like a radian or something like that, this Runge-Lenz vector, the direction of the semi-major axis, that conversation law gets sort of flushed down the toilet. So we could plot in the phase plane here of epsilon versus how much time we spent or how many orbits it went, a phase diagram of how many degrees of freedom there were. And we see from this that this order of discovery that what matters is epsilon times time. We know as physicists that that's because, in this case, you get a certain precession angle per orbit, but it discovers it all by itself. Another example that we had fun looking at is the magnetic mirror where, again, so the magnetic mirror looks like in the upper left in terms of physics. And here, there are different regimes. When we start messing with this and changing this potential, the AI-Ponqueror code we have automatically discovers that depending on how you set this up, you get actually different numbers of conserved quantities. So it automatically discovered, for example, the periodic orbit here and the V equals 0.95 case where you can see that there's more conserved quantities all of a sudden. For the three-body problem, finally. Also, the order of discovery is all sorts of interesting things. In general, there's only the total energy that's conserved. If you set up something like the solar system where the earth and moon, blue and red or whatever are doing their thing and the sun and green is quite far away, you still have a lot of energy exchange between the earth, moon system and the sun. The plot on the right shows the energy of one versus the energy of the other. So there's only one conserved quantity still, which is the total energy. But if you make the distance to the sun huge compared to the earth, moon system, then the time scale of energy exchange gets very, very long. And we find this phase diagram where you can see that for a limited period of time that you have this extra conserved quantity where the energy of the earth, moon system is separately conserved. We also discovered just automatically acute periodic orbits for the three body problem case. So I wanted to summarize to give you time for questions. I've talked about how mainly how I think physicists, we physicists can add a lot of value to machine learning. The aspect I focused on today is how we can take ideas from physics and help make machine learning more interpretable without losing any of its power. We've talked about how you can train a black box neural network to do something cool. And then gradually in some cases demystify it a lot and find something as accurate or more accurate, but just much simpler to understand. And I want to end by saying that, so we've developed some tools that we are very eager to apply it to various kinds of data. Every single equation I showed here, we just rediscovered, right? It's gonna be a really cool milestone to use tools like this to discover an equation of physics for the first time. We could have done that if we had developed these algorithms and run them 100 years ago, right? Before some of the formulas were discovered, but we would love to do it together with you. So if you have any fun data sitting around on your hard drive, either that came from an experiment or from a numerical supercomputer simulation or whatever that you did, and you think there may be some patterns in there that haven't yet been discovered, let me know. We really find the collaborate on this. So thank you so much. Great, wonderful. Thank you so much, Max. That was fantastic. So on behalf of the audience, let me give you some virtual applause here. And we have time for questions now. So we can do it both ways. People who are wanting to ask a question can either raise their hand or they can also put their questions into the chat window and I will then read it out for them. And I see that there are already two questions in the chat window. So let me go through those first. The first question is by Pascal, who is asking, do you see a difference in solving a problem? For example, finding the symbolic equation to describe a problem and understanding the involved physics? If so, what's the role of machine learning there and what's the role of the human physicist in resolving this difference? That's a great, really great question which gets even a little bit philosophical because we have to just find what we mean by understanding. I think it's pretty clear that the systems we have do not understand the systems, particularly more than your pocket calculator understands anything about the real world. When you are able to predict something, I think the essence of what we mean when we say that we understand something is that we can relate it to a bunch of other concepts that we have in our head about the world and so on. And these systems we have obviously don't. So, but I think even at the rather infantile state of machine learning today, they can actually help us human physicists develop better understanding because there have been many examples in the history of science where we actually discovered a formula by just spotting the regularity first and only later we became the understanding, right? I mentioned Kepler, it was just like that. He first figured out the ellipse and only later Newton came along and gave us a deep understanding for why it's an ellipse by deriving it from the inner square law and stuff. And Max Planck simile discovered this pattern in black body radiation and just did symbolic regression and guessed that curve with the exponential in it and then later got a deeper understanding. So I think this can help us by looking through massive amounts of data we have and discovering other noteworthy patterns and regularities in there that we can then take our ability to understand and try to explain better. Great, thanks. Next question by Azim who is asking whether you've been able to apply a method to any new problems where we actually don't know the solution yet. That's a great question. So as I confessed at the end of the talk, so far the answer, the number of new equations hitherto unknown to physicists that we've discovered is a round number, zero. And we're very excited about this. We spent so much time getting this working that we haven't really had much time for it. But I would love to collaborate with any of you who has some data set. The key thing is, there are two regimes of physics, right? There's the regime, the kind of physics where we have, where theory is far ahead of data, like string theory. I hope I'm not offending anybody here. And then there is the kind of physics where data is far ahead of theory. Like QCD, you know, we can measure also the properties of the periodic table much more accurately than we can currently compute it with lattice QCD. It's in the second category, I think we have the most hope. When we can measure something really accurately either in the lab or do a supercomputer simulation and get the answer in black box fashion. But we think there might be some patterns there to be discovered that we haven't yet found. If you have any ideas of that kind, if you have any kind of data, I would love to collaborate with you to see if we can reach that first milestone together of changing the answer to your question from no to yes, you know, actually discovering something new. It's clear that it's not impossible because if you had just taken Max Planck's data or Kepler's data or any of the data that gave those other equations we know, this would have found them, right? Okay, so there are so many, many questions coming. So let's keep going. There is one by Tim, please go ahead. Oh, thank you. Yeah, Max, slightly following on these questions, I just wanted you to say a little bit more about what you think might be the limitations of AI. And let me just give you two examples. One is philosophical and the other is more practical. The philosophical one was about the n-body problem or the three-body problem indeed, where Poincare famously discovered there was no formula for the motion in general of a body in a three-body problem. So you started at the beginning talking about giving your program data to find a formula and I'm sure you could give it a chaotic trajectory and it would find something terribly complicated. But it would never say, look, I'm sorry, this is impossible, this is by definition impossible. And that's something, you know, human beings seem to have a capability of being able to do, which I think yet we have to say that we don't understand how machines could do it. Can I just, before you respond, can I give you the second example, which is from my own- Actually, let me respond to the first one first. Oh, okay. But then we'll have it fresh in memory, if you don't mind. So yeah, this is a very, very good question. So first of all, the, if you take the three-body problem, yeah, it's clearly not particularly fruitful area of research to give a grad student the task to find an analytic solution to it. It's just a complete waste of time. But there are other things that are useful to discover symbolically, like what Poincare showed, right? You can, at least finding out the energy is conserved. Finding the conserved quantities can be done. Whenever you discover, if you discover a new integral of motion symbolically, that's very useful and that's sometimes possible for systems, right? So all is not lost just because you can't solve the whole trajectory. There might be some other aspects of it you can capture and that's great. The second thing I wanted to say is, yeah, humans are pretty good at knowing what's completely fruitless and what's hopeful. What would happen if you gave this to our AI Feynman? But what it would actually find is it generally doesn't give you just one formula. It gives you this Pareto frontier. It gives you a ranked list of formulas from simpler to more complicated, right? That get progressively more accurate. And what you would see from staring at that plot is that you can get arbitrarily high accuracy by just making them arbitrarily complicated by just overfitting the data like crazy. You know, you can, if you have to take the point, you can always fit it to a million degree polynomial or something that's useless. So you would see from the plot that this is a waste of time because you're never able to get more out of the model than you put into it. So if you add up how many bits you needed to describe the model plus how many bits you needed to describe the corrections to the model, you would find that that's not actually smaller than just describing the raw data set and it's worthless. The hallmark, this is the answer to your question, I think. How you can automate the criteria for what's good physics is when you get more out of it than you put into it, when it takes much less in from bits to describe the formula and the errors than to describe the raw data set. So an example of that kind is I found this old book once in the library with a hundred thousand spectral lines. And you can describe that super accurately with the three numbers and the Schrodinger equation, right? And what's so cool is this information theory we're looking at, it gives you a way of quantifying why that's so cool because the total description length of the thing went way down. And that's when you know you've learned something. Now, back to you. I could carry on, but I suspect there are so many other people wanting to ask questions. I'll shut up at this stage. Thank you. But I'll stick around a little bit after the official end of the colloquium. Also, if you want to chat or you can email me. Okay. But thanks. Thanks, Tim. We have Sabine, please go ahead. Can you hear me? Yes. Yes. Okay, wonderful. Well, first of all, thanks for the very interesting talk. I don't myself have a data set, but I have for some while trying to convince some astrophysicists that what all the astrophysical data for dark matters actually trying to tell us is that dark matter comes in two different phases. As you probably full well know, there's this longstanding tension that modified gravity seems to work better below a certain acceleration scale and particle dark matter works better on long distances. And I'm personally convinced that if one would analyze the data in an unbiased way, you would find that it has to have some kind of phase transition. So I just want to pitch this idea to you. Okay, great. Just two quick comments. So one unbiased way of testing theories to see if they're any good, right? Is by this information criterion again, you take your data set, figure out how many megabytes it is, and then you try to fit it with some formula. And now you see how many megabytes you need to store the errors plus for the formula. And if you can reduce that by a lot, you've learned something. And this is a more modern way of generalizing Occam's razor, that the formula that reduces the total description length the most is the one that's the best. And there's some beautiful theorems by Grunwald and others in Risanna and showing that this totally eliminates overfitting. And since you don't need a human arbiter to come in here, Occam's razor is quantified, right? It can settle contentious things. I also just want to add, since you brought up astrophysics, there are many cases where we don't have any, we don't think maybe that there's an exact formula, but there are still very useful approximations, like the Navarro-Frank and White formula for a cluster halo profile, right? That's totally something that AI-Fiman could discover. It would be on the pre-do frontier there, it's quite accurate. And Mon, the Mon formula could also have auto-discovered before Milgram did it. And to be a little bit humble, I think it's quite likely that every single formula that we teach, that I teach, for example, in my physics course classes here, even the graduate classes, is also just an approximation to something more fundamental in the effective field theory sense. So we shouldn't shy away from looking for formulas, even if we know they're not gonna be exact. They might still be very useful and give us good clues. Okay, great. Next question, Umut, please. Hi, can you hear me? Yes, loud and clear. Well, I mean, first of all, thanks a lot for the very interesting talk. So I have more machine learning questions. I'm not a physicist, what about that? So like it's about the first part of your talk about this AI-Fiman. As far as I understand, you are first training in neural network and doing some like analysis on top of the neural network, right? So my question is like usually, I mean, in like many deep learning problems, depending on your algorithm parameters, like the step size or the batch size, even though you find the zero training loss, I mean, a parameter, which is zero training loss, the quality of this parameter can be vastly different in terms of like test theory or some other aspects. And then like, for instance, the minimum description length could be different for this parameter itself. So I was wondering if you saw any changes in your like ultimate analysis by changing the SGD parameters that you use? Or like, do you think there would be some correspondence between the, you know, description length of the network and the resulting formula that you find ultimately? Good. So yes, we played around quite a lot with the neural network architecture and we did the first of all, the standard thing of stopping the training once the validation loss started to go up to limit over-fitting. We found that even relatively simple neural networks, just the three layer fully connected actually worked pretty well. You want to use a differentiable activation function like TANGE or soft plus or sigmoid, not relu, since we're trying to fit continuous functions. The nice thing about this information framework is that it also sort of immunizes you against over-fitting because also if you add, and any over-fitting you do is going to push you towards the random noise, right? And there's no reason why the noise should be well-fit by a simple formula. So it worked out pretty well. You can sometimes get led down rabbit holes by noise. That's why the Pareto thing made it a hundred or a thousand times more robust because we actually at each step have a whole ensemble of models. So even if the correct one didn't look like it was maybe the most accurate one at that time, later on, when you put it all together, that often sometimes emerges as the winner. Thanks a lot. Great, thank you. So maybe let's go to the chat for a while and I have two questions that maybe we can ask together because they're related. There's one by Syriam who is asking sort of how loosely you can actually define physics and still find this method useful. You mentioned at the beginning that the systems that have certain characteristics for your graph similarity trick to work out, but she's wondering how general this thing can actually be and whether or not you can actually use these things to apply them to some machine learning systems themselves, like some sort of introspection. And that's related to the second question by Irina who is asking a little more about the properties of the data. So what kind of properties should the data have in order to work with your framework? Does really any type of data work or does there need to be some sort of prerequisites that need to be met? Good, so the first question, does this only work for physics or more broadly? So you all know, those of you who are physicists, that one of the most effective way to piss off are non-physics colleagues is to tell them that it's all physics. Chemistry is just a sub-branch of physics and so is biology and economics and everything else. It works every time to raise their blood pressure. So if you take that attitude, of course, it's all physics, but more seriously, we looked at a bunch of formulas from chemistry also and we've even did some stuff from economics and the COVID spreading and so on. And it seems anecdotally that modularity crops up pretty much throughout the sciences, again and again in equations. Why is that? I'm not gonna give you a glib answer for it because I don't have a universal answer for it, but this is related to the more basic question that in general, the laws of the formulas in physics seem much, much simpler than generic ones, not just in being modular, but in many ways. Like if you take the standard model Hamiltonian of quantum field theory, except in gravity for a moment, why is it that it's a quartic polynomial where the quartic part is the Higgs, right? That's not the generic function, a quartic polynomial. It's incredibly simple. And why is it local? Why do we have locality? Well, that's related to translational symmetry and we can get into a discussion about why that is. There are all these ways in which it's simple. Why is it that all the math you ever do in your work can pretty much always, pretty much all functions, you always write them as compositions of functions with no more than two variables, two variables like plus and times, one variable like cosine and logarithm and exponential and functions of zero variables like pi. Why don't you seem to need functions of five variables? If you have philosophical thoughts about any of these things, I'd love to hear them. But for now, if we just take them as a gift from our universe, it's very convenient because it turns out it's a combination of all the things that I just said that enables AI Feynman to actually work so well. Then there was a question of data. What kind of data can you apply this to? Well, I mean, but really any data in practice, if you send in data, if you just try to do AI Feynman with a function of 10,000 variables or a million variables, because the inputs represent pictures, it's probably gonna fail, at least our code is not gonna do well. So then it's really nice to first do this progression step where you train a neural network to try to extract the features that seem really useful. And then just do AI Feynman on that. But generally, what I find exciting about this is it is very generally applicable. You also get to choose in the final step of AI Feynman where you try to solve the modules, which functions you try to solve it in terms of. So we put in various things like cosine and log and stuff, but you can custom do this. If you're a string theorist, for example, you might suspect maybe dialogarithms are gonna come in here, you can put them into your set. So you'll wanna use some of your physicist intuition to guess what kind of building blocks go in there. And also if there's some building blocks you're pretty sure are not gonna go in there, then take them off. That will speed up the code a lot. And yeah, again, you mentioned data, I hope you have some data, I'd love to collaborate on it. Okay, great, thanks you. Will, your question please. Will, do you still have your question? Oh yeah, sorry, my internet's a little bubbly. So if you're thinking about applying this to try and find a theory of everything, could you see some way of overcoming the variation in scaling between the different theories? I mean, general relativity versus quantum mechanics, obviously happening on hugely different length scales and four scales. Do you see a way of either incorporating data into that or possibly just using the combined models in order to find some further theory? Would you see a way in which you'd need to add in a way that the program can increase its own mathematical ability? That's a really, really deep question there. Let me say two things about it. First of all, I strongly suspect that if we ever get close to having an AI figuring out the theory of everything, then we have also gotten very close to artificial general intelligence where the AI can do not only that, but also everything else that we humans do. Then we have a lot more serious things to worry about than just losing your jobs as physicists. In the more near term, if we think about taking steps in this direction, I think the point you made about scale separation is very interesting. You could also phrase it in terms of dynamic range issues, right? We human physicists are very good at seeing that if you have the Earth, Moon system, say, right? And then the Sun system, there's two different time scales there. And if you have systems where the time scales are radically different, like time scales of the solar system motion versus the 200 plus million years time scale on which we orbit around our galaxy, then we humans have this intuition that we should really not try to solve this all at once. We should do it two separate ways. We should just ignore the motion around the galaxy, the slow stuff, and just look at solar system first, get that done. And then we can look separately at how the center mass of the solar system goes around the galaxy. I certainly haven't done much work on order discovering these scale separations. It's a fun thing to think about. If you just ran the progression algorithm on it, then what it would probably do is, it might actually have, yeah. Depends on how you send in the data, I guess, because if you do the example I mentioned of the solar system and the galaxy and you take pretty large voxels, then it's gonna not have any luck anyway, describing the small scales. And it's just gonna focus in on the large scale thing and do a good job on that. But yeah, we'll think about this more, how one could more generally have it deal with these multi-scale problems and break them apart the way we humans do into separate things. And then coming back maybe with some perturbation theory for how the two things also interact. Okay. Thanks for bringing that up. Do you think there's a way of defining the universe more topologically? As in we see stuff like energy and time are related via Fourier transform as our position and momentum. Like we relate energy and time through the Planck constant, right? So do you see a way of just removing all the physical constants and just finding the pure underlying topological formulation? So I didn't quite understand the topology angle. Let me say something about the part I did understand. So we've noticed that some physics is easier in real space as you said and some is easier in Fourier space. If you have a problem like some particle collision or something where it's actually the momentum that matter and that's the Fourier space that matter and then I'm hopeful that the progression algorithm would discover that that's where you should look because you're just asking the progression algorithm to take this big data set of the fields you've given it right and extract out from that the small number of variables that matter and it will be whether it discovers Fourier space stuff like momenta or real space stuff like positions or something else will entirely depend on what it finds to be useful. So there I think it can do it but maybe you can elaborate a bit more of what you meant by topology here. Well, what I really mean by that is if we remove all of our universal constants and say there's some form of projection factor then surely we can start to see the underlying mechanics of what force for instance really is like general additivity we say it's movement over geodesic whereas in quantum field theory we're saying it's the addition of multiple functions like either addition or annihilation operators basically. So there is some form of, there's some range on this scale can we find that one using the same algorithms? The algorithms I've told you about the answer is for them the answer is certainly no but it would be fun to keep in touch and talk about ways in which we can go in that direction. Okay, thank you very much. Thank you. Great, thanks a lot. So maybe before we go on let's just make sure that all the constraints are satisfied how much more time do you have Max? Because the questions keep coming actually. Oh, we can go a little bit longer. I mean, this is the, you know, you have to remember for me the most fun part is this because it's very boring for me to listen to my own talk just hearing things I already know but this I find fascinating, really engaging. Okay, wonderful. So then next question is by Yang Hui who is asking sort of more on the technical side. First of all he is thanking you for your inspiring talk and then he's asking whether you had any ideas of how your AI Feynman is relating to a different work in the field by Eaton and Metzka at all which is titled Discovering Physical Concepts with Neural Metrics. So have you heard of that and how does it relate to what you've done? Yeah, more generally discovering automated concept discovery is a very hot topic in machine learning and I think it's fascinating to think about that's another thing that we physicists are pretty good at doing and we don't know quite how we do it, right? It's also kind of related to the scale separation. The short answer to your question is no, I don't know how to do it now but I'll just share a couple of reflections on the topic. So for example, if you start with just, suppose you're an AI or you're a physicist and you just start with quantum field theory, then you might notice that they're looking, oh, wow, they're in QCD. There are certain bound states that keep recurring, you know, corresponding to, it's always these three quarks here, three quarks there, three quarks there and it's either up, up, down or up, down, never anything else. Then at that point it becomes very convenient to call, to introduce concepts and call these something, schmutrons and schmotons or maybe neutrons and protons. You can call them whatever you want but they emerge directly from the theory, right? There's a useful entity that keeps recurring. Similarly, if you're doing astrophysics and you're doing cosmological simulations from scratch and you start discovering all these gravitationally confined fusion reactors all over the place, you might decide to make upgrade, call that a concept, call it a star or something else. So I think the utility principle is a pretty natural way of defining concepts. Physical things, the sort of concepts we call objects in physics, I think can be defined pretty clearly in terms of energy. We tend to call things objects. If the interaction energy within the object is much stronger than it is with the environment, right? If I have an ice cube in here, I call it an object because the energy per particle I need to pull it apart, melt it or whatever is much bigger than the mechanical interactions with the environment. If I look inside there, a water molecule by the H and the O have much stronger interactions with each other than with other molecules. If I look within the hydrogen atom, the electron has much stronger binding energy to the proton than it has the things around it. If I look inside the proton, the quark so much more tightly bound to each other than et cetera, et cetera. But other kinds of concepts, there it gets much more subtle again. And I think actually if you can get some deep insights from physics, how to discover more general concepts like forces and other things like we talked about earlier, you're actually tying in with a much bigger quest in artificial intelligence, discover concepts more broadly in the world, which is fascinating. Great, thanks. There is a question by Yun Soo, which is again more on the technical machine learning aspect and it's related to sort of your progression project that you showed as the second one in your talk. And he's asking that you were mentioning that you had to put a five dimensional latent space into the model for it to understand the two degree of freedom of the rocket moving around the picture. Is there any relation or do you think that the same logic can be used to explain why is the very over-parameterized if your networks still work so well and that the effective dimension of the solutions that they find is actually very, very low. Maybe, yeah, that's a fascinating thought. I've wondered a lot about this. It seems like we physicists are very familiar with topological problems, like topological defects that can't get untangled. If you embedded things in a higher dimensional space, they will all resolve themselves. And my intuition is, yes, that there are many examples where machine learning fails to train not because you don't have any gradient information, but because there is a potential barrier in the loss function. You have to climb over to then be able to go down, right? In physics, sometimes we can quantum tunnel through it in machine learning, not so much. In physics, sometimes we do simulated annealing to deal with these things. In machine learning, the corresponding thing to simulated annealing is to take large step sizes initially, right, for your learning rate and then gradually shrink it. That's the equivalent of cooling off your physical system as you anneal. But in addition to trying that strategy, the annealing strategy of long step sizes first, then shorter. If you have neural network that is failing to train, yeah, I would, I think it's a good strategy, throw in some over parameterize it, put in some extra neurons, make wider layers initially, or if it's an autoencoder you're doing, try to make the latent space higher dimensional first, but then don't stop. In the end, when you're done and get a low loss, see if you can get rid of those extra dimensions again, because they're not needed, you know, in machine learning, I think it's very helpful to separate two quite separate challenges. One is the expressibility of your neural network, what class of functions is it actually able to express? And the other is the learnability. And often you need much fewer neurons or parameters to be able to actually accurately express the function, then you need to actually be able to learn them through the gradient descent, right? Because of these topological energy barriers you might have. So I think an interesting strategy generally is solve the topological problems by throwing in an extra, over parameterizing, and then don't just stop at the end of that, but see if you can then strip away all that pork again by some regularizer or by some of the techniques we used, maybe with some PCA or something. So you can get back to much simpler neural network again, which performs as well as the higher dimensional one. In fact, you often get even better precision once you throw away all those extra things because now all sorts of numbers that were approximately zero, you set them to exactly zero and it's getting closer to the right thing. So I hope that this topology insight can be useful for a lot of things more broadly in machine learning. Great, thanks so much. There is sort of a metaphysical question by Sam who is wondering about the relationship between the machine learning engineer who actually devises these algorithms and the actual physics to the speaker, the models that the system is able to discover. So he is asking, what if a physicist looks at planetary motion and they interpret planetary motion in terms of epicycles? And then if that physicist goes on and programs and AI, will the AI not also just describe or think of planetary motion in terms of epicycles, so garbage in, garbage out? Do you think there is something that can be solved about this problem? I think absolutely, and there I think we actually can get some good insights from what I talked about today. If you think about this in terms of information theory, so first of all, let's look at epicycles, right? At the time of Ptomei actually, we love to talk smack about him when we teach physics classes and saying that was so stupid and all wrong. But the fact of the matter is at the time, the epicycle theory was more accurate than the heliocentric theory saying the planets go in circles. So if you think about this plot that I showed you, this pre-do diagram, let me just share it with you again. Give me one, one second, share screen. If you look at, for example, the plot here on the lower left corner, right? The epicycles actually had a higher accuracy. So they had less inaccuracy. They would be, the epicycle theory would be a red dot that was lower down than all the heliocentric model of our star cos. But it had higher complexity, because it was more complicated. But when the data is very inaccurate, you can't really, it didn't make so much of it. Yeah, but then what happened was Kepler came along and came up with the ellipse model, which was even more accurate than Ptomei's. So now the Kepler dot is lower down than both Ptomei and Aristarchus, right? Where was it on the complexity side? Well, it was simpler than Ptomei's model. So Copernicus crushes Ptomei both on accuracy and simplicity. So Ptomei's dead. Also figuratively, by the time this happened. And then let's look now at the comparison between Aristarchus model, where planets go in circles versus Kepler. This model with circles is simpler, but it's less accurate. So those points are actually both on the Pareto frontier. You would see them both surviving here from our machine. So if we just had machine learning study the solar system, it would have at least two dots here on this frontier. One of them would be everything is just going around circles, really simple, somewhat accurate. And then you have the more accurate one where it's actually going into ellipses, but it's more complex. And if you do this with really accurate data, there would also be a third model, which is Einstein's model, where you see that the orbit of Mercury is actually slowly processing, still more complicated, still more precise. So I actually think it's, we shouldn't be too arrogant and say that all these people were losers because their formula was somehow wrong. They were actually worthy formulas to stay on this frontier, just like MV squared over two is here, right? It's not stupid formula. It's not as accurate as MV squared over two, but it's the most accurate for that level of complexity, and we humans often find those things quite useful. Still today, after we know general relativity, we still use Kepler's law a lot, right? And what I find so cool about this is that this approach of using this information theory will give you all the formulas that human physicists find useful, both the most accurate and complicated one like Einstein's and also various approximations that are remarkably accurate given how simple they are. We have one last question, and it's by Matthew, who is again wondering sort of on the data side and possible behaviors that you can get off the data. He's asking, have you found the data set which doesn't feature any sudden change of behavior that's intrinsic to the data, but it still gives you a function output that describes bifurcation? And whether or not you could see any applications of such a thing to detecting potential faults in systems, artificial intelligence ones, also important classical ones before they even happen. The one case where we did see bifurcations in chaotic stuff was when we applied this for the AI-Ponkery I talked about with dynamical systems, where we were varying some parameter and it went from being non-chaotic to being chaotic and we saw how we could discover that automatically. So we were able to do this in a phase diagram, right? Where we would see, if you think of the order parameter as simply the number of conserved quantities, you would see that there's a sudden phase transition that happens and you get these different phases up. But I think this is a really fun thing and would love to do more. If you have any data set there that you would be interested in collaborating on and looking for bifurcations in, please email me. It's tagmark.mit.edu and that's a welcome to all of you again. I think it'd be really, really fun to collaborate on these things. Wonderful. I really think there's no better way to stop it there. So let me thank you once again both on behalf of myself but also on behalf of the entire audience and especially on behalf of those people who have asked these wonderful questions. Thanks a lot for being so generous with your time for having this wonderful talk and for having this wonderful discussion at the end. Well, thank you. I mean, and I wanna thank the whole audience also. It's so cool. I feel so honored. Not only that there were like 140 people or whatever in the beginning but it's even more remarkable that 40 minutes after the talk there's still 40 free people here. That's gonna make me feel warm and fuzzy all of the rest of the day here. And it's only 12 30 here in Boston. So thank you so much. Thanks and see you next week for everybody who's going to join next week's seminar. Enjoy your lunch break. Thanks a lot. Bye bye.