 So, good afternoon, and thank you for the introduction and for the invitation. It's a very interesting workshop for me. Let me say that my talk is going to be slightly orthogonal to the previous Giuseppe, who talked about very interesting topics, but this is going to be more numerics. I hope you won't be too scared by the numerics, but I'm going to do my best in order to present it in also more theoretical way if you want. So, today I would like to discuss some recent applications we have been doing and also other people I've been doing in this field. And the idea is to use machine learning techniques, and in particular neural networks based techniques, to study quantum systems. So, what I would like today is, first of all, to give you a short introduction or rather a short introduction to the basic idea that we have and we are using in machine learning. So, basically, I will give you a crash course in machine learning so that we can understand a little bit what are the techniques and the simple mathematical tools that we are using, actually. And then I'm going to discuss how we can use these ideas in the context of quantum mechanics. So, even though maybe most of you don't know that, but we can say that one of the involuntary fathers of machine learning is David Hilbert, right? So, David Hilbert is one of the guys who has done a lot of things, great mathematician, and at some point he had a collection of problems of unsolved problems at this time, at this time, some of those are still unsolved. And one of those was to solve to somehow basically is the following. You have, like a polynomial, like a 7 over polynomial of this form. Actually, the precise form is not very important, and you want to find the roots of this polynomial, right? So, we know that if the degree of the polynomial is larger than 4, it's very hard to write down in a polynomial. I mean, you cannot write down in algebraic form the roots of this polynomial, but he was asking whether we can find still some compact form to write the roots of the polynomial as a function of if you want of A and B of the coefficients that you have in this thing. So, and this question was answered actually in a much more general way by Kolmogorov and Arnold, so other two pretty clever guys, and they realized that in general, if you have a genetic function of many variables, so in particular, you can even think that the roots of this polynomial are a function of A and B, so in this case, just two variables. But in general, if you have a function of several variables, so let's say f of x1 up to xn, then they showed that this result has been refined in the last years. I mean, it has been refined in the form that I'm going to write by Spragner in the 60s or 70s. Well, basically they showed, and this is the refined result, that this complicated or in general complicated, any complicated but bounded function of n variables can be written as a linear combination, as a finite linear combination, so notice that this n is the same n that you have here, of just two one-dimensional functions. So the first one that I'm going to call phi, capital phi, and this function takes, if you want, as an argument, also another linear combination, again over a small number of variables, lambda p of another one-dimensional function, small phi, of my variables. So here q is the same integer that I have here, and eta is just a number. And here the important thing is that what can mogorov and Arnold match to show is that in order to write an arbitrary but again bounded function of n variables, it's enough to have two one-dimensional functions. So two functions, univariate functions, if you want, phi, capital phi, and small phi. These are one-dimensional functions, and they are also continuous, and you can also take, for example, this phi to be monotonously increasing. So this is rather a surprising result, if you want, because it means that somehow all the complexity of the i-dimensional function is hidden in those two one-dimensional things. So this is an exact statement? Yes. It's called mogorov. Of course, I mean, this is a beautiful mathematical result, but in practice it's not necessarily useful in the sense that those functions can have a very complicated fractal structure, and all the complexity in computing, if you want, this f is hidden in computing, these one-dimensional functions. So in practice, from a complexity point of view, you are dealing still with the same complexity, but still this gives you an idea of the fact that's what I want to... P goes from what? P from one to n, again, and it's again the dimension of the number of variables there. Do you know about the lambdas? The lambdas, they are between zero and one, just coefficients, but the important thing that I wanted to show here, that I wanted to stress with these results and actually why this is somehow related to machine learning, is that we immediately see from these that taking, if you want, functions of functions, so like combinations of functions, is an extremely powerful object that allows us to represent in a much more compact way than just taking linear combinations, if you want, of basis function and arbitrarily i-dimensional function. And this thing has been discovered, if you want, independently also by nature, in a sense, since basically neural networks, which are, if you want, like a very simplified way of describing the brain, work in a way which is similar to forming function of functions of functions. Right, so this is what, where the connection between these two words of pure mathematics and machine learning counts, because what is going to be important in the following is the notion of artificial neural network. Sorry, what about the thing you started with, this polynomial? Ah, right, yes, so at the end, I mean, these guys managed to show that it's true that you cannot write the root as a simple algebraic form, but you can write it as a composite function, so it's still a relatively compact. But is that something special about this example it shows? No, this was the original motivation that then Arnold, then Komogorov, and then Arnold later. Is it important you had a seventh order polynomial or is it just an example? It's important, it's larger than four, if you want. In this case, they were able to find exactly the function. Yes, so in this case, I think that they were able to find exactly those two phi and capital small, but this result was more general and it can be extended to an arbitrary high-dimensional function, so that was the main statement. Now, the point is, so what I'm going to use in the following is an artificial neural network. So an artificial neural network can be seen basically as a high-dimensional function. So basically, I will deal again with NF of x1, x2, up to xn, and I will represent basically those input variables as like, basically dots. So you can think of those as, for example, real values. And what this network does basically is that it takes this input and it transforms it through a nonlinear circuit which is basically a function of a function of a function. So how does this work? Well, basically, the idea is that each neuron, so I assume that each of these input is a neuron in this artificial neural network, takes the input and transforms it into a linear combinator. So basically, it takes a linear combinator, so there will be a lot of variables that we go through, for example, the next layer of neurons. So those are the input variables. So basically, they are just inputs for this network. And then each of those neurons, each of those input is fed into the second layer of neurons. So basically, what this means is that, for example, this variable here that I can call, for example, y1, y2, and y3, y4, et cetera, is basically a function. So more formally, I can say that yj is equal to some function, so some nonlinear function of the linear combination that it's coming inside those variables, okay? So, for example, this will be a linear combination of all the way, with some weights of the input variables plus a bias term. But it's the same phi for all of them. Yes, so in typical applications, we take the same phi, okay? So you see that this somehow resembles this object that I was using before. And then the idea is that you can compose those variables at the next level. And in particular, you can define, for example, another layer of neurons that we can call z1, z2, up to zk, which is, again, if you want, again, a function of a linear combination, now, of those yj, right? With some other weights, generally. So j and k of my yj, with some other bias term, which is k, okay? So you see the idea, the idea is that, at the end, so you take your initial input, so this i-dimensional vector, and you transform it through a sequence of functions of functions, taking a teach step linear combinations of those variables. Okay? Not necessarily, this is the simplest case where we take the same phi, but you don't need to. So the main point now is that from nature, we know that basically those kinds of functions, so those phi of x, are typically, first of all, continuous functions because they come from the biology if you want of the brain and in particular of the neuron. But what we know is that those objects, which are called activation functions, are typically functions which, that are called activation functions because they are, for example, as a function of x, they are typically zero force, if the signal is smaller than some threshold value, which can be changed in tune, changing this bias term b. And then they activate this spike if you want, if you are dealing with a neuron. If this threshold, if the value of the input is larger than some threshold. So, but it's very important that this object is nonlinear. Otherwise, if you deal with linear phi, so you are somehow back into the case of just linear combination of basis functions, which is not as powerful as taking functions of functions. But to go back to your motivating example, you mentioned that Arnold and, they found that the fees could be fractal or could have been non-continuous. So, in the general case, this phi is continuous. But if f of x1 through xn is continuous or smooth, does that guarantee that the fee, no, okay. So, that's the tricky part. So, if f is continuous, I mean, if it's, I don't know, lip sheets or something, it doesn't mean that you can find the phi which is lip sheets. Actually, at the end, but the point is that in practical applications, we have to fix the form of this phi to something we can work with. For example, an apropolic tangent or something. And then in that case, of course, this fear and breaks in the sense that this sum does not go to n, but to a larger number, which is not necessarily over the n that can be. In general, exponentially large in n, actually. But still, this is the motivating idea why we want to use functions of functions. So, in this way, you avoid loops and feedback loops? Yes, you can also insert feedback loops if you want and this actually increases the computational power of these networks. So, you can show that if you introduce loops, you have a much more powerful machine in terms of computational complexity. But in this case, it's not. But isn't the feedback included in the non-linearity? Exactly. It depends what you mean by feedback. So, the non-linearity of phi or function, general function kind of takes care of the looping. By loop, I meant that the output can be fed back into the input. But so, that's what I mean and it's not included in this architecture, this specific one. But yeah, we can discover it. So, this is the notion of artificial neural network that will be needed by us in the following. So, let's see a little bit now what we can do with those artificial neural networks and what kind of applications people have been done, let's say, in real life. So, the easiest thing you can do is the so-called supervised learning, which is basically the simplest form of machine learning that you can do. So, the basic idea is that you can imagine that you have a task, so, a generic task. So, you want to automate a task, you want to solve a problem. And the solution to this problem is given by some F bar, if you want, of some input variables. So, let me give you an example. Imagine that my input, so, as a vector x is equal to a string. So, the string is just we. So, in French, right? And you want that, for example, that given this string, F bar of x is equal to im, right? So, like the English translation of this string. So, this is a very specific example, but this is like, if you want a basic version of Google translate. So, Google translate works, more or less, in this way. It takes a string in some language and translates it into another language. So, this case is French from English, but it can be whatever. So, the goal, if you want, in this case, so, the translator, in this case, the goal of the machine would be to find a good approximation of this F bar, of this unknown F bar, right? Which is a very complicated high dimensional function. So, how can we do that? Well, the basic idea, the paradigm of machine learning is that we work with a large amount of data. So, we have, for example, at the end, a lot of pre-translated text. So, if you want, in this context. But formally, from a mathematical point of view, this means that I have a collection of labeled examples, X, L, and of pre-translated sentences. And I have a lot of them. So, L is an integer which goes between, I don't know, between one and Ns, where Ns is much larger than one. So, this means that I have a collection of pre-translated sentences. And to each of those, I know what, for each of those, I know what the translation is, right? So, if you want, I only have a partial knowledge of this function on those points. And I want to reconstruct, I want to infer the value of the generic value of this function for some other strings that have not been pre-translated before. So, this is how Google translates works. I mean, like in a very, at zero level, at zero order, that's how it works. Now, how do we do that? Well, this is basically, as I was mentioning, an inference problem. And the idea is that we define a loss function. So, in general, I will have, what I want to do is that I want to approximate this unknown high-dimensional function. So, F bar of X with my artificial neural network. So, F, A, and N, if you want artificial neural network, which in general will depend on some parameters P, some parameters P, which can be, for example, those weights that I have inside my functions or whatever you want. So, also the architecture of the network itself. Now, what I do is that, I mean, to solve this equation numerically, as you can guess, is that I define a loss function, which depends on the, if you want, on the parameters in the network, which is basically just the sum of my labeled examples of basically the square loss or another generic loss, basically of what I expect by, as I mean, the output of the neural network at the current values of the parameters, minus the pre-translated thing. So, the y, so the output that I was expecting for those samples that I already have classified, okay? So, now I have this function, which depends on the parameters, I basically just minimize it and find the optimal value for those parameters. So, this is the learning part, if you want, of the algorithm. So, there are two parts. So, in the machine learning, there are two things. So, the machine, which is, in this case, the artificial neural network, so this i-dimensional function, which depends on parameters P, on a set of parameters P, and the learning part, which is basically an optimization problem. So, it's a i-dimensional optimization problem, where we want to minimize some quantity, in particular, in this case, this loss function, okay? So, one of the interesting aspects is that this optimization is done in a stochastic way. So, I will just briefly sketch this because it's somehow interesting. And, in particular... When you have something non-linear, I mean, pretty messy, I have to say, you might have many minima, right? Yeah, that's precisely where I want to go. Very good point. So, the idea is that that's exactly the problem. But it's solved in a very elegant way. And I would say that people in the machine learning community were not aware of this from a physical perspective until very recently. But somehow, if you are a physicist, you can realize why the optimization algorithm they have works so well and avoids local minima. And the idea is the following. Basically, what they did, what people used to optimize this i-dimensional function, is just a wrong version of the gradient descent, right? So, in general, the gradient descent tells you that... The gradient descent tells you that if you are at iteration k and you want to find the parameters at iteration k plus 1, basically you take the old set of parameters and you subtract, basically, eta is a small parameter, which they call the learning rate. You basically subtract the gradient of this function that we want to minimize, right? So, let me call it g of p. So, g is a vector itself. So, this is the standard gradient descent approach. You know that this thing will converge to a local minima of the function. Now, it turns out that what the people in the machine learning community do, now it's going to be problematic. Okay. Well, never mind. So, the g of p. So, I mean, what these people were doing, I mean, what is done in the community of the machine learning is that instead of taking the full gradient, so the full gradient would be the sum over all the examples that you have in your library of pre-translated strings, if you want. So, let me call it g of x, l and p, right? So, probably small g. Okay. So, you see that basically, if you take the gradient of those individual terms in the sum, it's just a small gradient here. So, what it's commonly done is that instead of taking the full sum, we approximate this full sum with a partial sum over batch, so over a number of samples, which is much smaller than the total number of samples. So, l over n batch batch. So, where nB is much smaller than nS, okay? So, typically, it can be 100 or 50 even of this thing. Now, this means, in practice, that instead of taking the full gradient, we are taking a stochastic approximation of the gradient. So, this means that instead of taking the full gradient, we now have what's called a stochastic gradient descent if you want gradient, which is equal to the exact gradient g of p plus a sigma. So, plus, if you want a normal distribution, we've basically, we've variant sigma, where this variant sigma. So, let's assume now that all the components of the gradient are uncorrelated just to be on this side, where this sigma basically skates like one over the number of samples that have taken here in this small batch, okay? But it turns out that if you do this thing with a noisy gradient and you put it back here in the standard gradient descent, you end up with a Langevan equation. So, a stochastic differential equation, the time discretized version of the Langevan equation, where at the step k plus one, you take your old parameter p of k, basically minus eta, the true gradient g of p, minus eta, this normal variable. And this can be easily identified with the standard, with the Langevan equation, since if we make now the simple substitution that eta, so this learning rate, is a time step in time, so delta tau, and that this sigma, so basically the variance in my gradient, is equal to, actually sigma square is equal to twice the temperature over delta tau, the time step. So, this is the first order Langevan equation. And the important point is that, in the asymptotic regime, what is sampled by this first order Langevan equation, is basically the probability of finding a certain set of parameters p, is proportional to the exponential of minus this loss equation L of p over t. Depends much on the barrier. Yes, yes, no, but what I would say, no, but this statement is true. If the components of the gradient are all uncorrelated, it's always true that the probability distribution sampled by this stochastic differential equation, is the Boltzmann equation, where the effective energy is the loss function. So here, the benefit of adding a noise, if you want, is that instead of doing a simple standard optimization gradient descent, we are doing basically, we are exploring the full, if you want, classical energy spanned by those parameters p, and in particular what we can do and what people do to optimize those parameters, is that we can change the effective temperature in the system. So for example, if we anneal the value of beta, so if we slowly turn it down, it's completely equivalent to changing the temperature. So if we change the temperature, we can then go into the global minimum of this function. So it's much easier to optimize this neural network that we are dealing with. So we have a double benefit, because this is also intrinsically much faster to compute numerically, but we are also approaching the actual minimum of the function we want to optimize. Okay. So this was, let's say, the physical interpretation of the stochastic Newtonian descent, which I find particularly nice. Yes. So physical interpretation might be useful especially for us, but when I look at that cost function, do I want to think that this has to do, anything to do with the Lagrangian? No, the name L is loss, not. Sure, sure. But what I'm saying is that there is no notion of locality there, probably. No, no. Precisely because these connections in networks can be highly non-local, typically, yes. Our physical intuition is probably, for most of us, built on local objects, on local light engines or Hamiltonians. It should be, should we... No, the only interpretation, the physical interpretation you can take from here is that you are dealing with a classical Lagrangian, I mean, classical energy, which is highly non-local, but I mean, the interpretation you can take here is that the annealing, the procedure you are using is like a simulated annealing for Hamiltonian, which is unphysical, yeah, I agree. But the fact that it's non-local may even be helpful because it helps formalize faster as opposed to... No, it helps describing more complex functions. For example, in the case of quantum systems... What I want to know is if I can take my intuitions about thermalization in physics, right, to here, whether the fact that this is not local, this cross-function is not local, it's gonna be an abstraction to that or actually it's gonna make thermalization... I think that locality is not necessarily something that is related to how many local minima you have in this i-dimensional thing. At least that's not how I see it. But in practice it's true that the best networks that people have studied and are using have this notion of locality, somehow built in. So this convolutional neural networks, for example, have local filters and those are also easier to optimize. So in a sense, it might be that the two things go together, that somehow locality helps optimizing, yeah. But I guess that's if your function has some natural locality, like for image processing maybe, but not for a... Yes, yes. Yeah, so, yeah, I mean, yeah, exactly. So if you have, let's say, a low, slightly entangled input, if we anticipate already the following discussion, then it's clear that a local function with some local structure is better. The other cases that behave like spin glasses, where there's like a trouble finding the cross-function. Yes, certainly yes. There are cases where it's very hard to optimize those functions, it takes days, years. So, and the people have studied this from the perspective of spin glass system, there are, there's a lot of leaders on this, actually. I can give you the references. Okay, now... That is clear. Thank you. I'm fishing. Would it be possible to turn the lights off, please? Just to show you a little bit some slides. Thank you. Yeah. Yeah, okay, so what I wanted to say, it's okay. So, basically this idea that you have a function which allows you to solve a complex problem can be generalized to many other problems, not only to language translation. For example, optical character recognition. In this case, the input is the image and the output would be the digitalized version of the input, if you want. Or speech recognition, so the input is a wave sound and then the output is the string of text and all other things, sort of things that you can think. So, this approach is very general. It's really something that can be applied to a lot of applications. Now, that's why, for example, machine learning, since this approach is very general, that's why also people have started using this approach in many domains of science. For example, in the case of particle physics, we can use this machine learning approach to help spot events in a sea of events that happens in an accelerator and try to identify those and see whether they are associated or not with some, for example, Higgs boson or to design molecules, all sort of things. But, I mean, just to, and people, I mean, have started wondering whether, at the end, like having a machine that is able to solve the problem for us would somehow transform the scientific method itself. Right? Because if we have a machine which decides for us whether, you know, like a problem has been solved or not, or whether this molecule is going to be this or this other thing, at some point we're going to have also probably a shift in the scientific method. But, of course, we are not there yet. So, I just wanted to discuss with you, for example, one of the first applications that have been done in the context of matter, which is very recent. So, the idea here is that we can use this kind of approach, so this kind of supervised learning, so where we basically minimize this loss function to identify phases of matter. So, you can think that in this case my input are, for example, images or something of different objects and you want this argument to identify whether those input images are in a certain phase of the matter. For example, solid, liquid, solid, liquid, et cetera. So, we have a set of pre-classified images if you want in this case, so where x1 is already associated with solid, x2 to a liquid, and so on. And so we can do this procedure of basically minimizing this loss function over this finite set of pre-classified images and then we can ask, we can give a new image and say, what do you think this is? Is it a solid or a liquid? In this case, we can use in this context the machine learning technique to classify phases on a more refined level. That's what Juan Carraschi and Roger Melko have been doing last year, basically published in this paper. And as a toy example, as a more concrete physical example than those images I was showing you before, they took the icing model. zelo, Instead of as input configurations, as samples taken from the classical partition function of the Ising Model, as these x1, x2, x3, x4 and those images are taken at different temperatures in the phase diagram of the Ising Model. And basically for each of those images, one can pre-classify them and say, for example, that those were obtained when the temperature was lower than the critical temperature, so they were in the ordered phase, so the label was Y1, Y2, which is ordered, and then those other phases instead were in the disordered phase. So we have, again, this large chunk of data that we've taken, and we have pre-classified and then we use a machine to see whether, for example, it is able to identify this phase transition on another model, for example. So in this case, they trained the model on a square lattice, so really the machine was able to identify the phase transition in the as-in model on the square lattice, and then they took the very same machine and they applied it to the as-in model on the triangular lattice, where it was able to predict the phase transition, so the critical temperature with very good accuracy. So basically, the output that the machine was giving was compatible with the exact known results from other Monte Carlo simulations. So the basic idea here is that the machine was able to learn the order parameter in the magnetization, if you want, in the system, and use it to learn a phase transition in another system where it was not previously trained on. So that's the idea. The learning here is on the visual part or just including, I don't know, magnetization, see if it's... No, it's just on the bare spin configurations. Just visual. So you treat if you want those things as images and you classify them with either zero, one if they're ordered or unordered. It's not simple just to go to the magnetization. But because you are a physicist and you know that the order parameter is the magnetization, but the idea is that we want the machine to find out that the order parameter is the magnetization. So that's the idea. How big was their training database? I honestly don't remember in this case. We should have a look at the paper. It's in the thousands, I guess. How do you reformat the triangle of lattice so that a configuration on the triangle of lattice looks like... I mean, it's... Ah, right, yeah. Yeah, that's a good point. Yeah, I don't know. Actually, I should ask... Yeah, that's true. In order for the machine to make a prediction, it has to take as input something that can be compared to Hollywood's triangle. So probably take the dots, you pick... I don't know if you have that appointment. Well, yeah, but you can probably just do a simple... I think it's the same, it speaks for a diagonal. Yeah. It does take the same pattern. For the uniform magnetization, it probably does not matter how much it... Right, right. No, but I mean, what you were saying is, how do you map a square on a triangle as the input image? Yeah, and whether that can affect the result of it. Like, you mean... Yeah, I think they were taking something like that. Like... That's what they are doing. Yeah. Well, I guess. But I don't know, I have to check what one did. No, but then we have a problem here. No, like this, yeah, I think. Is that the... Like this, yeah. Is that the diagonal one? Yes, yes. So I think that you can do this kind of... Yeah. So this was one of the first applications of those ideas to classifying phases of matter. It has been also used for other things. But now what I would like to do is to show what we did in the context of quantum mechanics, like to really find, if you want the ground state of some Hamiltonian, possibly a correlated Hamiltonian, and to see how we can use machine learning to solve this problem. So... OK, so... So what's the goal? So what's the thing that we would like to solve? Well, we have a Hamiltonian, so we have a many body Hamiltonian. Say we have the upper model or the Assembler model. And one of the main tasks that we would like to solve that we would like the machine to give us some help for is to find the ground state of this Hamiltonian. Right, so we would like to find, if you want, this psi which solves the Schrodinger equation. The ground state of this object. So in order to use a machine learning approach, the first thing that we have to do is to somehow change or this Schrodinger equation into an optimization problem. So something which is suitable for the learning approaches that I was discussing before. But this is something which is relatively easy to do. We know indeed that the ground state of the Hamiltonian is nothing but the minimum over all the possible, if you want, many body physical ground states of the energy functional E of psi, right? So we're basically... We're basically E of psi is nothing but the expectation value of the Hamiltonian over some normalizable physical state psi. Right? So in principle, we already know from the variational principle that if we manage to... We can immediately transform if you want the Schrodinger equation into an optimization problem. So if you want the learning part of transforming the quantum mechanics into a learning problem, it comes from free, thanks to the variational principle. Now, the other part that we need is the machine, right? Because there are several possibilities that... There are several ways we can use, for example, machine learning or in general neural networks to solve this problem. So one possibility would be that, for example, we solve for the Schrodinger equation for some Hamiltonian and then we try to find the solution for some other Hamiltonians in the spirit of what I've described before. So we solve for all the Isenberg models and then we say, ah, but what's the ground state for the Isenberg model? But of course, you can understand that the machine will get confused because the two fingers are not necessarily well connected also from a physical point of view. So what we do instead is that was our idea, is that we represent the many-body state as an artificial neural network. So what we do basically is that, first of all, we introduce some many-body basis for my problem. So, for example, let's assume that I have a spin one-half problem and that my many-body basis is just a collection of n spin numbers along the z-direction, okay, so sigma one up to sigma z. And then what I do is that I represent the many-body state. So basically, those amplitudes psi of x as an artificial neural network. So in particular, I will call those amplitudes, which are in general complex values because I want to describe the most general wave function. So this is a function, if you want, in this case of all my magnetizations, so along the z-direction for all the sites that I have in my system. So basically, what I do is that I want to approximate the exact ground state, so in general, a state of my, which is associated to my Hamiltonian, with some artificial neural network, right? So for a reason that will be apparent in a moment, what I will do is actually that I will associate the log, if you want, of the wave function with an artificial neural network. So what I will do is that I will say that my artificial neural network describes the log of the wave function. And this log of the wave function will depend again on some parameters p. So this parameter p can be, for example, the connectivity metrics that I have in my network can be whatever you have as a parameter in your artificial neural network. And, I mean, from a physical perspective, those parameters now become variational parameters for my variational wave function, which is now in the form of an artificial neural network. So what I can do now is that I can find the ground state of this Hamiltonian, for example, solving a learning problem, an optimization problem, which is in a very similar form of the optimization problem that we've seen before. It's more complicated, but the idea is more or less the same. And in particular, the main ingredient that you need to know is that if you want to compute, for example, the expectation value of the energy for a given set of parameters. OK, so if I call E of p, again, the expectation value of the Hamiltonian for a given set of parameters divided by the normalization, then this object can be written in stochastic way, so as a Monte Carlo average, if you want, over configurations, so over many bodies, spin configurations, which are sampled according to psi of x. So x now, again, is my many body variable, which indicates all the magnetization of p square, so modulo square of these, times an estimator, which is called the local energy in Geringon. So in practice, you can show that the local energy, I mean, just to show you what it is, in practice is defined very easily as the sum over the basically all the non-zero elements for which those matrix elements of the Hamiltonian are non-zero, of psi over x prime and p in general over psi of x and p. So this is the definition of the local energy. So basically, what this expression is telling you, and you can easily derive it, is that the quantum expectation value of the Hamiltonian can be written as a statistical expectation value over this probability distribution psi square of some object, which is not a classical energy. It's somehow an effective classical energy that people call the local energy. So in a sense, you can recast the full quantum mechanical problem into a fully classical problem, where the classical energy that corresponds to the quantum one is this local energy. And in particular, what you can do, so this is the heart of the variational Monte Carlo approach, and at the end what you can do is that you can also compute the gradients of the expectation value of the energy and you can minimize it in the same way we were doing for the other quantities. So this is how this approach works. Basically, do you train the machine? Yes, so I train the machine. So at each step, so this approach works as well. So the first one is that we sum, so we fix the parameters in my network, the p's. And I sample from, so I generate a lot of samples psi of x, which are generated according to the current wave function of the squared. OK, so those are generated, for example, via Markov chain Monte Carlo. So I generate a lot of sample, which are generated according to psi squared. Then using those samples, I use basically the gradient computed as an expectation value, statistical expectation values, to feedback the parameters. So basically I will say that p at k plus 1, so the next step will be equal to p of k, minus some estimation of the gradient, stochastic estimation of the gradient that I have at step k. So this stochastic estimation of the gradient is obtained using the samples that I have generated in one. And then I go back to one, because I've changed my parameters, I resample and then until I converge to the minimum. So it's more complicated than what I've shown you before in the sense that here, this is a self-consistent procedure where I don't have pre-labeled examples of the solution that I want to find. I have to generate them myself. So it's more similar to have a machine, which learns how to play a game. For example, if you have the game of Go, which is the one that Google excelled last year, basically it happens that there you have an energy, so you have some function that you want to minimize, which is the final score that you get in that game. But in that sense, you don't have a winning trajectory, you don't have a winning strategy beforehand. You have to find it yourself in a self-consistent way. For example, playing against yourself. And it is the same idea here. So we are playing against Schirlinger until we beat it and we find the best energy. So let me show you just some numerical results. Before the numerical results. So you said this is variational Monte Carlo using an artificial neural network as a variational answer. Is that it? And so the question is, whether this is a good answer, is that true? Yes. And do you have, so now we look at the numerical? Yes. So there are some, in principle, reasons why this is a good answer, and there are some numerical evidence. So I think the answer is yes. It's a complementary answer if you want to NPS or then some networks. No, but if somebody knows variational Monte Carlo, all they have to do is to choose this as their variational waste function. And the selling point would be the numerics or there is some... No, the selling point is that you can increase systematically if you want the precision of this object, increasing the capacity of the network. So you can put more and more neurons and systematically converge to the ground state. So in a sense, like the sites of the network is the dual of the bond dimension in the NPS language. The number of sites, yeah, the number of neurons that you have in the network in the deep part. So I'm going to discuss this in a second. We have a quick follow-up question. For NPS, we actually know that they capture like the properties of local Hamiltonians and that's why this is the right answer that's maybe to represent ground states of local Hamiltonians. But this ansatz doesn't have locality, but this is still like a reasoning why this would be the right or efficient ansatz for ground states of local Hamiltonians. Or would it be the same, like if it is for example... I was expecting this question. No, I mean, no, the problem is that the fact that this thing is no local is also beneficial because you can describe more easily highly entangled states. So here you don't have a limitation in the sense that if you have a volume-low state, you can efficiently describe it with a polynomial number of parameters because if you take long-range connections, it's trivial that you satisfy a volume-low, right? So in this sense, not having localities can be an advantage to describe, for example, chiral states and those kind of things where you can have a strong volume-low. But there is a way of recovering locality. I mean, I will go there in a second. But whether it's local or local, it depends on the architecture. Yes, so at the end it might even be that the architecture which is found by the network is effectively local. So we are giving the freedom to be no local, but at the end the network is free to choose to be local. Of course, if you constraint to be local, it might be easier to optimize. So that's the thing we were discussing. So is it the right time to ask whether you will have any computational problem if we are trying to evaluate this network? Yeah, okay, so. Numerics first? Yeah, so first I wanted to discuss, okay. So the, no, just a second. So the actual answers that we use is of this restricted Boston machine, RBM form. So in this case, basically what we do is that we take this phi to be, the log of, I mean, if you can derive it like that, basically it's just the log of the hyperbolic cosine of X. So this is the activation function that we use and this activation function comes from some effective Boltzmann distribution of an object which is done like that. So where you have your input nodes, sigma one, sigma two up to sigma n. So those are my physical degrees of freedom which are connected through some weights, so those W's, okay, and to some hidden nodes h1 up to hn, hm, so where m is a free parameter. So the more of those I put, the more the bigger network. And basically I say that the output of my, basically the wave function of my system, so psi of X, psi of sigma in this case is nothing but the Boltzmann weight associated with this object. So it's basically the sum over these hidden variables of the exponential of sum over j, w, i, j, sigma, sigma i, hj. So basically this is a classical partition function where I have interactions between my physical variables and some hidden degrees of freedom which are the neurons in the network if you want, okay. So the more of those I put, the larger the brain, if you want this artificial network is and the smarter I can make it. So there are actual representation fields and that guarantee that if m is large enough I can represent any n dimensional function. Okay, so now I have this parameter m and in particular I have what I call alpha which is the ratio between m and the physical number of values. So m over n that I can play with. So I can increase alpha and increase the accuracy of my calculation. So if you can, no, sorry. The complex numbers are coming from somewhere here? Yes, so the w's are complex valued. So of course the interpretation as a classical partition function breaks down at this point. And this is something which makes this thing very different from traditional applications that people have done in the classical context. So what we need more if you want is to put this object complex. So okay, so here you can see for example the energy as a function of this iteration number when we optimize the energy for the one dimensionalization model. So this is a model where you can do numerics with the MRG or you can even find the exact ground state with basically better answers. So you can see that when we increase alpha we can systematically improve the accuracy. So from this plot you cannot see it very well but you see basically you see it here. So this is a zoom of the final part of my iterations of my optimization. You can see for example that going from alpha equals two to alpha equals four, the energy goes down. And you can also see from the scale that we are very close to, we have a very high accuracy on the ground state energy. So you see for example that the scale is... So what is alpha? So alpha, sorry, it's introduced here, I introduce it here very briefly. So it's the ratio of the number of hidden units over the number of visible units. So if you wanted so many neurons you have after the input layer. So the larger it is, the better you find the ground state. So this is, I think this is 80 sites, but you can do 80, yes, 80. So, but you can see it also for other models. So this is the relative error you do on the energy on the ground state energy as a function of alpha. For example, for one dimensional model we typically find more or less a power flow behavior as a function of alpha. So this is a log-log plot. And you see, for example, for the 1D transverse fieldizing model for different values of the transverse field that you can systematically converge, typically also to very large accuracies. And you see that alpha equal four, it's not necessarily a very large number, but you can already achieve very typically very high accuracies. And a very interesting thing, which I believe is somehow also worth being explored more, is that this is a rather compact representation of the wave function. For example, if you compare the equivalent NPS, which has the same accuracy, for example, for HE equal one on the critical point of, say, 10 to the minus four, and you say how many parameters I have in this NPS, and you compare the number of parameters that we have in the RPM, you see that they can be easily a factor of 100 or sometimes even more. So, in a sense, the fact that we have a nonlinear decomposition of the wave function can be more expressive than a multilinear decomposition like the NPS. Of course, it means also that it's harder to optimize on this, I think, we agree. Now, the thing is that, of course, in 2D it's more problematic in the sense that converging is more delicate, as you can see here. We still managed to do a bit better than some perhaps results, but it's harder than 1D. So, this is something also that it's not entirely understood at the moment why the optimization is harder in 2D. Most likely it's not a limitation of the answers itself, but it's an optimization issue due to this high-dimensional space where we are optimizing our parameters. Okay, so this, I don't want to go too much into the details, but basically what I wanted to show you is that in some cases you can see that the ways that come out are local. For example, in the two-dimensionalizing model, you can see that those weights have a local structure, which is somehow reminiscent of the local anti-ferromagnetic correlation that you have in the system. Sorry, how do we read that plot? Yeah, I didn't want to go too much into the detail, but so basically here we also use translation symmetry in the system. So this W now depends only on one index. So basically, yeah, you have to mention this as a convolution on your network, I don't know if you were worried about it, but basically you have a filter that you translate at each time. But okay, this is a bit technical, I didn't want to discover this. But W14 is a weight, is it a number? It's a weight, yeah, so this is the weight here basically. And why does it, what does it mean? That's a complex number? So this means, yeah, no, sorry. So this is the, in the case of the isomer model, you can take it purely real because you know that if you do the rotation, you can put it purely real for the ground state. So what does it mean that a given coefficient has a representation in two dimensions? Ah, because this is a two-dimensional isomer model. So what I'm doing is that here, basically, I'm fixing, if you want J, so I'm fixing the hidden unit and I'm varing I. So this I now is a two-dimensional index. So I'm fixing the index of the hidden unit. So I'm fixing, for example, this hidden unit and I'm showing the value of this weight over all the other spins, which is a two-dimensional object. Should we think of it as just some sort of indication that the wave function has some degree of locality? Yes, yes, yes. In this case yet, but it's not always the case. For example, if you look at this, for the isomer 1D, the weight is highly non-local. And we've seen clearly that it depends strongly on how you optimize the wave function. So if you force it to be local, it stays local. So this is something also interesting. Is that with periodic boundary condition? This is, those are all with periodic boundary. Because then it still shows to be local, no? In this case, well, okay, okay, it's true that this filter, for example, is relatively centered into this point. But less than, let's say, this one, is it? Anyway, what I wanted to say, okay, if you want, you can also have a look at some codes that are online that I wrote to somehow show how to apply this scheme, numerical scheme for the isome model and other cases. And more general code will be hopefully uploaded before next spring. So where you can really play with it and also extend it if you want. Okay, no, now I think that my time is almost, is basically finished. I wanted to say that we can do also unit dynamics. I don't have time to discuss this, but we can also do solve the time-dependent shedding equation using the time-dependent rational Monte Carlo. It's a method that we developed some time ago. And then, yeah, okay, in our paper, we compared also unit dynamics to the exact results and found that also in this case, it's more delicate, but you can also find very good results for the dynamics. This is a quantum quench. You can use also it in two dimensions and that's something which is coming out. Okay, and then I don't have time to discuss this, but we have also applications to basically tomography, which is the problem of reconstructing the state of a quantum system from given experimental measurements. Okay, so for, I guess, some of you are interested in the properties of those states. We can discuss them later, and I apologize, but I ran out of time. In particular, we can describe, we can discuss some applications that people have done to carry out spin liquids, where this answer has been shown to be superior to existing approaches. Okay, so let me, yeah, so I can just last think. Is that basically there is this strong representability theorem by Gao and Duane, which shows that you can represent any physical state, so basically any output of any finite depth quantum circuit. We have a neural network with only two layers. So if you have the second layer here in the Boltzmann machine, you can show that you can write exactly and with only a polynomial number of neurons, the output of this circuit, which is a pretty strong result. And in this case, you can show that the weights that you have in this network are purely local, so they correspond if you want to the gates that you apply at each step. And at this point, we also have a construction to generate those weights for simple, so this is for coming work, where we have a procedure to generate if you want the weights in this network for some models like the Ising model or the Ising model. So we have, if you want, this is an alternative to the standard patindega representation, which can be done in the space of artificial neural networks, of two-layer artificial neural networks. The important point is that only two layers are enough to describe all quantum mechanics, basically, which is tontiga. Okay, so I finish here, and if you have more questions, I will be happy to answer to them later. Thank you. So if you have, if two layers is good enough, is there any advantage to going deeper? Okay, so the tricky part is that two layers is in, it's a Boltzmann factor for two layers. So it's not a function of a function like I wrote before, unfortunately. So the tricky thing, if you want, I mean the thing that at the end matters is that you cannot compute exactly this amplitude, because it would involve tracing out the second layer, which you cannot do analytically anymore. So only if you have one layer you can do that analytically. If you have a second layer, you cannot do that anymore analytically, and you have to do some approximations, or to find ways to compress this second layer back to the shallow layer. So that's the important part. In dynamical system literature, when you iterate a function, iterate a function, so on, at the end emerges some kind of universal, right? So it doesn't matter the precise form of the function, simply because you apply many, many times. It's the same also here. I mean, it matters a lot if you choose one function there. No, in practice numerically we've seen that, okay, you can use this log of cosh, which is a function like that, but you can also use other functions, basically people typically use function, which is zero, and then it goes like that, and they don't give the same results, basically, at the end. They give the same results. So the important thing is that you have some nonlinearity. So that's the only thing is that you have some nonlinearity, and that's the only really thing that matters. Then the specific form of the function is not so important, but I have to say that if you want to use complex value, the variables become stricier, because you have to make sure that you are dealing with some, typically with some analytical activation function, at least for the way we optimize the parameters. So typically this choice is what we use. I mean, just an addendum, but this is, how to say, experimental result, or there are some theory behind it, which ensures you that something's going on. No, there is no theorem about the specific activation function and why they should give you the result. Apart from the genetic thing that tells you that if m is large enough, and you have a genetic activation function, which is just a bounded function. So, okay, so the theorem is the following. If you have a genetic activation function, which is bounded between zero and one, but monotonously increasing, then if m is large enough, you can describe an arbitrary function. So this is the theorem. But in the quantum context, I don't know, this can be probably refined, but I don't know. Okay, it seems like in some 2D cases, your method gives better results than other existing wave functions that people construct that isn't other techniques. Yeah, so it gives comparable or slightly better than depends on your taste or on your... But what I want to tell is that, for other methods... In this case, it's sensibly better. For other methods, I know that even computing expectation values of certain operators in 2D case becomes problematic, because it's hard to contract objects indices. For your method, for your wave variation, wave functions to compute this energy estimator, to how large volumes can you go? So let's say, in the PEPS case you were mentioning, it's problematic to contract the state to find the amplitude, right? In our case, what's MPR or MP complete is optimizing the parameters. So this optimization that we have to do is a computationally hard part, which... And we can easily, not easily, but if we are not careful, we can get stuck in some local minimum and we don't find the ground state. But for us, computing the amplitude of the wave function or the variational state is very easy. So it's polynomial. But can you recycle information to go to client parallel lattices? Yes, so we are experimenting also with that, so basically that you train the thing on a small system and then you increase little by little the lattice. And if you do this, you recover a formal locality also in the weights, typically, not always. I mean, if you have a gap system, you know.