 come back to the Hitchhiker's Guide to Condense Matter and Statistical Physics. So dedicated to machine learning in condense matter. So this is our third day. And today we dedicated to machine learning many-body quantum physics. So today the main lecture will be by Giuseppe Carleo from EPFL later on. But before, we start with basic notions. So follow you. And the basic notions will be given by Filippo Vicentini, who is a collaborator and collaborator of Giuseppe Carleo and also from EPFL. So as usually, we'll ask the questions and you'll put questions in the questions and answer box. And we'll try to answer all of them during the lecture. So I give the word to Filippo and enjoy today's lectures, everyone. So Filippo, please. Thank you, Asia. Let's see. OK, so thanks, everyone, for the opportunity of being here, giving this introductory lecture to machine learning for many-body quantum physics. You know very well what's the format of today's lecture. So now I would be giving this introduction. I would mainly cover two topics. So what are neural quantum states and how we can use neural networks to encode the information of quantum states and to represent them. And then how we can use this tool to solve some interesting problems in quantum mechanics, such as finding the ground state or performing the time evolution. While this afternoon, like while soon after this introduction, Giuseppe will be giving a talk on a seminar on exciting new developments in the field. So yes, so the talk is divided into two parts. This first part is more about the neural quantum states while the latter on the problems we are trying to solve. And I will take some questions in the break between the two parts, and then at the end. So I think that I don't really need to motivate anyone, any of you, about why we are interested in quantum physics and why we want to study what systems. I mean, you all know how exciting it is. The developments are nowadays on quantum computing, on experiments on high temperature superconductivity. There are experiments and there are theories studying how chlorophyll and the mechanism to convert photonic energy into chemical energy work by using quantum physics. So behind all those problems, all those fields, we have the framework of quantum physics that we use to describe those systems. And of course, the most important thing, the most fundamental part of quantum physics maybe relies on the fact that we want to describe those states. So I'm sure you know it very well already. When we try to describe the quantum systems and the state of this, we have a problem because this is very complicated. So imagine we take a system composed of spins, so particles that are either down or up. So it's a binary state, 0 or 1. And consider a system made of n different spins. So if I just was to describe such a classical system with a classical state, then I would need to describe the state of every single particle composing the system individually. So I want one bit of information for the first spin. It can either be down or up. So I need one bit. The second particle is also described by one bit and the third and so on. So if I have n particles, I need n bits of informations. And therefore, if I increase the size of my system, I will need the same number of bits of information. And therefore, it's relatively easy to store on computer in the memory of my computer millions and billions of particles, state of us, because the growth in the memory requirement is linear. Instead, in quantum physics, so if I want to describe this as a quantum system, and it's quantum state, I need to serve a wave function, which you know very well is basically a complex number associated with every possible configuration. So I need to store a complex number corresponding to the up, up, up configuration, to the up, up, down configuration, and so on and so forth. And this complex number, it's somehow a probability distribution with some information about the phase. And the problem is that the number of possible combination increases exponentially with the number of particles in my system, and therefore the memory requirements of storing the wave function increases exponentially too. And this is a problem, because nowadays on my laptop, I can store around a bit more than 30 particles, I think. But then I would need to soon use a cluster or a supercomputer. And there is no way, there is no supercomputer on Earth that can store more than, I think, 50 or 60, 50 particles, the state of 50 spins, for example. And I doubt that in the next years, we will be able to solve this issue just by technological innovation, because this increase is exponential. Every time we add another spin, every time we add another qubit, we need to double the amount of memory we need. So this is just a formal description of what is going on here. Essentially, the wave function can be written as this psi, this wave vector. And basically, I need to store one combination, one complex number for every possible element in the basis. And in this case, I chose as a basis the up, down basis for every spin. And this is the same thing. So when I do this, what I'm doing essentially is I'm describing, I can describe any possible wave function in the whole Hilbert space. So the Hilbert space contains many possible wave functions, all of them. And many of those are not interesting for what we are doing. So very often, we are just interested in the ground state of the system. Sometimes we are interested in highly correlated phases of matter, because we can build technological devices with them, and so on and so forth. And most of the states that we are interested in are structured to them. So they respect some symmetries. So actually, there are fewer degrees of freedom within them. They have some interesting correlations. And therefore, I don't really care about describing all possible wave functions, but I only care about describing a subset of them that are physically relevant. And for one of the first proposals that was brought forward, still at the beginning of quantum mechanics, was that, in general, I don't really care about describing the whole wave function and all its entries, but I could actually write it down with a wave function as some function that instead of storing every possible entry, the wave function, I just store its dependency on some parameters. And those parameters should be fewer than the number of degrees of freedom of the Hilbert space. Essentially, what is going on is that I have some parameters, hopefully fewer than the size of the Hilbert space. I feed them the wave function. So actually, I feed them to a function, an arbitrary function. And then by doing these, I can actually compute every entry in my wave function. So of course, this is interesting if it can solve my memory problem that I talked about before. Essentially, if the size of the space where those parameters live is polynomial in the size of my system instead of exponential, then I kind of addressed this issue. Because now I only need a polynomial amount of memory to store my state instead of an exponentially large amount of memory. At the same time, if I can do that, but to compute expectation values or any physically relevant quantity, I still need to perform this sum over the whole Hilbert space, then I didn't address my problem. I actually just hid it because I will still need to do an exponential number of operations on it. I still need to perform this sum on my computer. So actually, there are two classes of variational states, variational in the sense that they depend on those variational parameters, this w. So there are computationally tractable states, which are states such as mean field, good swiller mean field, or matrix product states, where I don't need to perform this whole sum of the Hilbert space, but I can recast it as some operation which is of polynomial complexity. For example, matrix product states, like remap is sum over some product of matrices, which have a polynomial size, mean field, for example, just doesn't sum on all the possible entries, but just on the whole local Hilbert space, and then you just perform a product, and so on and so forth. However, in general, this function could be something with which we cannot do that. So imagine this size in a neural network or an arbitrary nonlinear function. It's not easy to recast this sum in something that we know to try to treat exactly any polynomial time. So this second class might belong to computational efficient states. So for a variational ansatz, for a function to be a computationally efficient state, we need to address two requirements. The first one is that it must be efficiently valuable, which means that I can compute the wave function, given the set of parameters, and given a certain element of my basis in polynomial time. And this is generally true if I have a polynomial number of parameters. In general, the number of operations that I will need to do will also be polynomial. At the same time, I also need to be able to sample configurations from the square modulus of the wave function efficiently. And this is not so trivial because the probability distribution induced by it is this function. And you might see that at the denominator here, I have the norm of the wave function. So unless my ansatz is normalized already, and therefore I know this entry, I will need to compute this sum over an exponentially large space. So I believe Giuseppe might mention later some ansatzes, a class of ansatzes where the denominator here is easy to compute. But in general, it is not true. So we need more advanced techniques to do that. So if those two requirements are met, that is an interesting theorem, and it's actually very trivial to prove, that it is possible to compute any expectation value of a k-local operator with polynomial accuracy in polynomial time. So k-local operator is a notion that comes from information theory, which corresponds in terms more physics term, to an operator that has at most k-body interactions. So Hamiltonians, like physical Hamiltonians and physical observables, in general are k-local because we only consider one or two particles interaction. Think about the Ising Hamiltonian, we have two body interactions. Like in some cases, we have three or four body interactions, but usually they're just like there is a notion of locality in our models. So the proof of this is extremely simple. It just stems from the fact that if I start from the definition of an expectation value, and then I insert here the identity, I can just essentially expand here on the whole Hilbert space. And then if I divided both sides of this equation by psi applied to sigma, essentially what I'm doing, I'm collecting here one term, actually this term, sorry, which corresponds to a probability distribution. If you notice, essentially this term here, it's very simple to see that p of sigma belongs in air, is a real number. It's positive because I took an absolute, a modular square, and it is in the interval between 0 and 1. And it's also normalized to 1 if I sum over the Hilbert space by definition. So you see that here I have a probability distribution, and then I'm multiplying on the other side by this o-loc. This is a local term. And I claim that we can compute this term in polynomial time. And this is because even if here you see this sum of eta, which means over the whole Hilbert space, so I would have 2 to the n elements in theory. In practice, if my operator is k-local, what happens is that I'm fixing a row corresponding to the sigma entry in this big matrix. And then I'm looking at all the non-zero columns in this matrix. And if it's k-local, most of its entries are zeros. So there are only few, polynomial, a few non-zero entries. And therefore, this has polynomial, a few non-zeros. Psi, we said already, it's an hypothesis that we must be able to compute it efficiently in polynomial time. So we can compute those entries in polynomial time. Now, of course, I still have this sum here over the whole Hilbert space. But we can also address that. The reason for that is that if you notice, what they've wrote here is a sum over a sum variable time of the probability of this item times a quantity. So this is exactly the definition of an expectation value or of a statistical average. So some people might write it with an e of o-loc. So if I'm able to extract elements of the Hilbert space that are distributed according to the probability p of sigma, I don't actually need to perform this whole sum. I can just average o-loc, this local estimator, over a smaller set of elements, hopefully polynomial if you. So actually, the question is, can I sample this efficiently? Because once I do, then I can compute the average as a mean. And the error will go down as the inverse of the square root of the number of samples that I took. So of course, if I take infinitely many samples, my estimate would be exact. And I can control the accuracy with a number of samples. The question is, how can I sample this efficiently? How can I sample the square modulus of a wave function efficiently? The problem is, as I said before, is that this denominator here cannot be computed easily, at least in general. So there is a technique which I believe someone already mentioned before in last week lecture, I think, which is called Metropolis Hastings Monte Carlo, where if we can compute not the probability of an entry, but actually a function that is proportional to it. So in this case, just the denominator of the probability distribution, what I can do is then generate a chain of elements, a succession of elements, sigma 0, sigma 1, sigma 2, so on and so forth, that would be asymptotically distributed according to the probability distribution that I'm trying to sample. So the algorithm is very simple. You start from an initial configuration, which you can generate however you want. Let's say you pick a configuration at random. Let's say, sorry, let's say up, up, down. And then at every iteration, what you do is you propose, according to some rule, a new configuration. So let's say that my rule is I pick one of those pins and I flip it. So this is my sigma, let's say 0. My sigma prime will be up, down, down. So I flipped the second spin. And now what I do is I compute the probability of corresponding to sigma 0 and I compute the probability corresponding to sigma prime. If the probability of sigma prime is greater than the probability of sigma 0, it means that I went in a direction where there is more probability. And then I accept the new move. And I will repeat the algorithm starting from this new state. If instead the probability decreased, what I do is that I don't reject it outright, but I have some probability of rejecting it. I exponentially suppress the probability of accepting it. And if I do this many time, so if I repeat this operation, I will obtain a succession of states. And let's say apart from the beginning where there is some correlation with initial state, whatever comes after will be distributed according to the probability I'm trying to sample from. Of course, I need a sufficiently large sample size. Of course, my transition rule must respect some properties of my system, but in general, this method works and is very powerful. So I don't want to get in the details of why it works. Let me just say that it is derived from detailed balance or microscopic reversibility, which essentially is the idea that if I'm at equilibrium and I have a certain distribution of the configurations of my system at equilibrium, then the probability to be in a state and to go from that state to another state must match the probability of a reverse process. And by doing that, and the only thing you must keep in mind is that you want to, if you define the probability to go from one state to another as, and you split it with the probability of proposing a move and then the probability of accepting it, you can actually derive this acceptance formula that was proposed by metropolis in Astin's back in the 70s. So what this algorithm does essentially is that even if I cannot compute the full, even if I don't know the full probability distribution, but I can compute some, let's say the nominator, so I can compute ratios. Since I'm computing a ratio here, the normalization factor, which is constant factors out and I don't need to know it anymore. This is very powerful. So imagine for example, that I decide to use as a variational function a neural network. So what I can do, for example, is I can pick a very simple two layer feed forward networks, a restricted Boltzmann machine where the input of this machine is, I don't know, a bit string of up, up, down, which would correspond to one, one, zero, one, something like that. And then I multiply it by a matrix, w, I add some bias, and then I pass it for an ordinar activation function, usually logosh, but you can use anything actually. When you sum the outputs. What this is doing is, I mean, this is a neural network. And the reason why it is a good idea to use neural networks to do this and wish we have shown there are several results nowadays in literature showing that this is a good idea to do this, it's because neural networks are very good at capturing correlations in your system at capturing some hidden correlation within whatever is the input that you're feeding. And therefore they can efficiently compress the information and instead of you needing to store this exponentially large state vector, the weight function, you can actually just store a few, I mean, still a lot, but much fewer parameters. So this w and this beta in this case, this function, as you can see, it's polynomially, it has a polynomial cost to evaluate it. I mean, you just need, I mean, this is just a matrix vector product. And those matrices have sizes of the order of square, the number of spins in your system. Logosh is a function that you apply linearly. So I mean, it has some fixed costs for every entry. So this is, this can be, these are aspects of the first condition that we asked. And I showed you that we can also sample from it efficiently. So these, which we call narrow quantum state. So using a neural network to describe a variational quantum state actually satisfies all the requirements that we asked for. And that means that we can use it as a valid variational state. So with that, before moving on to the second part where I would be talking about how we can use this technique to actually do some, to actually do something interesting and actually solve some problems, I will start to take a few questions maybe. So let me check. Yeah, there was some questions in question and answer box, so you want to. Yeah, just a moment. So could you again say what do you mean by sampling? So by sampling, I mean, but I, what I've shown is that I am taking, I'm rewriting the operation of taking the expectation value of a quantum operator, which traditionally involves a sum over the whole Hilbert space into a statistical average of a quantity O log, which depends on the entries. And I'm averaging this quantity. You can think of it as a random quantity over some distribution. And this distribution is the square modulus of the weight function. So unless you can, so sampling means that I want to take configurations, elements from my Hilbert space. So configuration like, I don't know, up, up, up, down, up, down, up, down, configurations like that, according to their probability. So imagine that this is my, this is my Hilbert space. So here I have up, up, up, down. It's not working, sorry. So imagine I have, this is the direction of the Hilbert space. So here I have up, up, up, down, down, up, and down, down. And here I have probability. So imagine that my distribution of probability is this, something like that. I don't know, just random numbers. I want to be able to extract, to have a set of configuration that are distributed according, like approximately according to this distribution. And something means this, extracting this set. So this is classical Monte Carlo. Do you have any comments, ideas on quantum Monte Carlo? So the idea here is that this is classical sampling. So I'm classically sampling from a distribution. So in this metropolis algorithm, we assume no interaction. I'm not sure what you're asking, because now I'm just trying to sample from a distribution. So this is simply a technique to sample a probability distribution. I'm not assuming an underlying model. I'm not assuming anything. I don't know if maybe you can, we can unmute Lavi Kumar. So maybe you can ask the question. Yeah, so Lavi, you can. Yeah, hi Filippo, thanks a lot for this talk. So yeah, so actually my question was that these underlying spins do the interact. So that was essentially the question when you were sampling them with the probability which you actually explained very nicely how the sampling is going on. So I was saying these underlying spin chains which you flip, do they have some kind of this local interaction among them or it's just that we have like a distribution of these spins chains? So what I'm doing now is that I'm not talking about any model in particular. I'm just saying that how would you go if you have a set of parameters for your variational state and compute expectation values. So in this, I'm not doing any assumption of what the model is. And the spin chain on which I'm doing the sample is actually configuration, like basis element of my Hilbert space. So I'm sampling basis elements. This is completely unrelated from the physical systems I'm studying. Okay, thank you. So another question, what is the difference between neural network and restricted Boltzmann machine? So in a restricted, this is very simple, a restricted Boltzmann machine is just one particular type of neural network. Like it's basically, I think it was already talked a lot about in the first lectures you had in this course, two weeks ago. So that's why I didn't really talk about it, but essentially this is very simple neural network where this is the input layer, this is the output layer and I only have one intermediate layer in between. Restricted Boltzmann machine is just a name for this particular kind of network. In general, you can add many more layers, you can add some very particular interactions to it. And yeah. By avoiding the exponential cost, are we losing any kind of information? Okay, indeed we are doing two things. The first thing we are doing is that we are parametrizing the Hilbert space with some function. And of course this function is not able to represent in principle every possible way function, it will only represent the subset of it. So yes, of course I'm cutting away parts of the Hilbert space that I in principle do not care about. A way to see that are for example, I don't know the mean field answers. The mean field answers cuts away any state that has quantum correlations between different sites. Matrix product states bound those correlations. So neural networks remove parts of Hilbert space in a much more contrived manner that we don't totally understand but they are doing exactly the same thing. This is for the answer. Then for the sampling, what I'm doing is that I'm losing some information about the expectation value. So I don't know the exact value of the expectation value anymore. I only know an estimate with a certain error. So this is where I feed them away the exponential complexity. In neural quantum states is the algorithm able to choose physically relevant Hilbert space by some means like probability or do we provide it already with a wave function with variational parameters when it calculates the expectation? Okay, so what they was talking again, what I was talking now is just about the neural quantum states in general. So it just an efficient way like matrix product states or mean field to parametrize the Hilbert space. In the next part of the lecture I will be talking about how we can determine the parameters that give us the state that we're interested in. So how we can determine the parameters for the ground state and so on and so forth. So basically we have an optimization problem. Okay, so then that's the cost of calculating the wave function drop if we consider only symmetric and anti-symmetric wave functions in case we add identical particle systems. Yes, indeed. So if you actually insert some information about the structure of your problem into your neural networks. So for example, imagine you want to describe a standard contested matter system on a laptop and this system has some point symmetry or translational symmetry. Then you can, for example, you can insert you can actually see that the translational invariance reduces the size of this W matrix in order to be so that the output is invariant under translations and actually it reduces by a factor of N in 1D because you have N possible symmetries now and therefore it does reduce the cost. So it's in general, it's a very good idea to use this information to reduce to further constrain your neural network or your variational assets. What is meant with polynomial accuracy? I mean that the accuracy depends polynomially on the number of samples that they take and therefore if I have and samples the accuracy goes down as one over square root of N, not exponentially. So for example, if you are near a phase transition and you want to determine an observable wave exponential accuracy because this allows you to tell in what phase you are in, for example. I mean, you would need an exponentially large number of parameters of samples. Okay, so is it sure that the neural network is taking advantage of the fact that samples in real-world data sets actually live in a small subset of the Hilbert space which we human cannot reconnect with? I'm not sure I understand the question. So the whole Hilbert space is a space of wave functions, right? And the wave function is actually a space and the wave function is actually a vector in this space. So can you just, Heyman-Zao, could you just, can you unmute Tim and maybe you can ask your question? You can ask the question. Hello, can you hear me? Hi, yes. So what I mean is that those samples, those data in the real-world data sets, they actually, for example, all the data are like water molecules. They only live in a small subset of the Hilbert space. So their wave function actually is in a small subset. So but we cannot, we don't know the correlation between those molecules. So we don't know what the actual subset is, but neural network can help us capture those information. So it's actually implicitly finding the subset. So is that true? Okay, so with neural networks, it is hard to say exactly what part, with neural quantum state, it is hard to know exactly what part of the Hilbert space we are parameterizing. So there are studies about it, but it's still hard to say exactly what part of it we are parameterizing. But in general, what we are seeing is that this parameterization does capture, is able to describe physically relevant states. So... Okay, so that makes it more efficient to come to compute, right? Yes. Oh, okay. Okay, thank you. So I'm just... We have 25 more minutes. I'm not sure how much more of the... Yeah, I'll just go on with the second part of the presentation and then in case I will finish answering the questions. Okay, so in this second part, I want to address a bit some, I want to talk about some problems that we can solve with neural quantum states. So in general, we can think of, I can think of two very broad class of problems. One of those is that... One of those is when I want to determine, for example, the ground state. So the weights corresponding to the ground state of an Hamiltonian. You give me an Hamiltonian, I need to find its ground state or I need to determine the state as evolved by this Hamiltonian, something like that. And the second category of problems is reconstructing a quantum state. So imagine you have an experiment and you don't know exactly the state of your system when you do this experiment, you want to determine it. So what an experimentalist can do is it can do several measures, local, like local, whatever. But eventually he has a set of measurements. He knows the basis on which he performs measurements and he wants to train a neural network in order to describe in order so that it describes the state of his system, but a priori doesn't know. So I think I'm not sure that Juan Carrasquilla, we talk about this second application next week. Today I will focus more on this category of problems. So again, the first thing I want to talk about is determining the ground state. So you all know that the ground state is the low, it's the eigen state of the Hamiltonian with the lowest energy. And so determining essentially what they want to do is given a neural quantum state. So a state described by a certain neural network, you fix the architecture, let's say a restricted Boltzmann machine. Now I want to find the parameters, the set of weights, WGS, and best approximates the ground state. So to do this, I want to recast the problem of finding the ground state into an optimization problem. And this has been done a long time ago by, with the formulation of a variational principle. Essentially you can notice that the energy is an observable. So it's real. It's, we know that the state of all possible states in the Hilbert space with the lowest energy is the ground state. So that given any possible set of parameters W, the energy would be higher, greater or equal than the ground state energy. So if I found, if I find the set of parameter that gives me the ground state energy, then I know that I have found the ground state. And in general, when it's true that the lower the energy that a set of parameters gives me, the better is the approximation of the ground state. So in general, what we are doing is we want to find the set of weights that give me the minimum, the smallest energy. So this is really an optimization problem. And therefore it can be addressed with either global optimization techniques where you evaluate the energy on every possible configuration, on every possible set of parameters. And then you find the one with the lowest energy. Of course, since the space where parameters live is very highly dimensional, not exponentially large but still it's, I mean, I can usually have two, three, four, 500 dimensions, thousands of dimensions. I cannot do this in general. So what we try to do is we use iterative optimization techniques. Essentially you start from an initial set of parameters, you just throw them at random, like you have an educated guess about your initial parameters, W0. For those parameters, you can compute the energy. We know that this is efficient. Then we compute the gradient of the energy for these parameters. And we use this gradient to actually optimize to correct the parameters. So essentially at every iteration, we take our parameters, we subtract the gradient multiplied by some speed or learning rate as we call it and the optimization rate if you want. And then that way we generate the new set of parameters and we do this on and on so on and so forth. Now it is interesting to notice that you can think of this equation as delta W divided by some sort of discrete time delta T is equal to minus eta the gradient of W of E. So essentially what we are doing is we are solving some sort of differential equation, discrete differential equation for the parameters W and we are going down the potential well induced by this energy. Of course, the first question you might have is can we compute this gradient efficiently? Because I just told you that we can compute the energy but not its gradient. So this is quite easy to show. Essentially, I'll go back to what I showed you before. I was telling you that the energy is an observable, no? So we can rewrite it as some of the whole of the space of the probability of an entry times a local estimator, this E lock, which can be computed efficiently because the Hamiltonian has only at most one, two or let's say few body interactions. And then I can rewrite the expectation value of energy as the statistical average of this E lock. Now the gradient of the energy is then this vector where for every entry I have a derivative with respect to one parameter. So if you want, you can also think that this is basically D in the W1 of EW1, W2, et cetera, WN, and parameters, let's say, right? So this is just a different way of saying. And so those are just derivatives. And in general, I think you've seen already the back propagation rule, you know that we can, there is a way to efficiently compute the gradient of the neural network with respect to its inputs or its parameters by using this back propagation technique. And it has roughly the same order, the same cost complexity as evaluating the neural network itself. So I mean, so computing every entry of this vector, it's can be done efficiently. And in particular, you can show by doing some algebra that it's not complicated but would just take some time. But the gradient can also be written as an expectation as a statistical average. So you have a statistical average of E log, again, time some okay, and this okay is the gradient of the, the log derivative of our ansatz or neural network. So I told you that we can compute the derivative of a neural network efficiently. Here I'm just taking the product and I'm averaging. So again, those are statistical averages. So I'm not doing the whole sum over the Hilbert space. So these in theory is the sum over sigma, P of sigma, E log of sigma, okay, W of sigma, but I'm not doing the whole sum. I'm just sampling sigma once and then taking the statistical average over this small finite polynomial large subset of basis elements. This means, however, that the gradient that I have estimated is not exact. It's not the exact gradient, like what I had written here. Instead, it's a noisy estimate. So I have the expectation value, but in my, let's say equation of motion, the equation that I used to update my parameters, I have to keep track of the random term that they, let's assume that the error is distributed as a Gaussian with a normal distribution. Then the error I know goes down as the, like the variance goes down as one over the number of samples I've taken, right? So if I take infinitely many samples, this error goes down to zero. There is no noise, the equation is exact. But if the number of samples is finite, this is not true. And this is actually very interesting because this really starts to look like the Elangevan process. So basically the equation governing, like the approximate equation governing the motion of a particle in a potential well, like the potential well is determined by this. And for us, basically it corresponds to our energy functional. While the noise term for a particle, for a Elangevan process, depends on the temperature of the medium where this particle is. So while in physical terms, this would be a temperature, for us, this temperature is essentially set by the number of samples, like their inversely proportion. If you think about it, if I take an infinite number of samples, my temperature is zero. So I do this optimization, this motion is exact. Instead, if I have a finite number of samples, I would be at a finite temperature. And this is very interesting, no? Imagine that the potential we are trying to optimize is something like that. So I do my optimization, so this is energy, and this is whatever parameter we have. So imagine I do my optimization and I fall down in this local, local minimum. If I'm here, in principle, if my gradient is exact, this is a local minimum, a gradient is zero, I cannot get out of it. But since I take a finite number of samples, I have a finite temperature. So there is a certain probability by which I will get out of this well and continue my optimization and eventually fall down in hopefully the global minimum. So this is what happens when you don't do a gradient descent, but you do a stochastic gradient descent or an approximate gradient descent. And this is one case where doing these things like approximately actually helps us because it helps us not fall and not stay inside of local minimas. Then there are other problems. Usually there are big regions where the gradient is almost zero. So it's very hard to optimize in this region, but still let's say local minimas are less of an issue. Especially if at the beginning of your optimization, you keep a number of samples that is not too high in order to, exactly for this, in order to avoid falling in local minimum, but you can increase this number. So with that, I hope that I convinced you that we can find the ground state by recasting the problem of finding the ground state into an optimization problem. Another interesting problem is time evolution, right? Also because if you can perform time evolution, I mean, I give you a state, you are able to compute the state at every successive times, successive times, given an Hamiltonian, you can think also of performing some sort of imaginary time evolution where instead of evolving in real time, you evolve in imaginary time, which will allow you to converge exponentially fast towards the ground state. So to do this, we just catch the way we do it. Imagine you have a state given a parameter. So let's say this is an interesting state and you want to evolve it. What we are trying to do essentially is we want, we know how to do the evolution, at least how to analytically write it. Now it will be e to the minus i h delta t psi of w. And we want to find a state psi of w plus delta w, where delta w is some update of my parameters that is able to approximate my unitary time evolution. So the way we do this is we use linear approximations. So if we assume that the time steps are small enough, we can linearize the exponential of the unitary operator and we can write it down this way. At the same time, this right-hand side, this psi w plus delta w, we can expand it to first order in Taylor around w. So we can see the linear effect of changing the parameters. So what we get is basically, again, c w, of course, because we are expanding around this point. And then we find, we apply all the log derivatives. Basically those are, yeah, those okays are then the w k log psi w k. Basically there is a sigma also. Sigma, sigma, we have some of sigma. So now what they want to do is I want to match those two, right? So the way we match, essentially, we try to find the set of the w that solve this approximate requirement is that we try to, we define some overlap. This is the overlap between the two states. It's also the Fubini's 2D metric. We try to minimize the distance between those two states. We'll not do the full calculation because it takes a bit, but essentially you can actually find the solution to it. So you can find that the solution to the delta w, the updates that solve it is given by this equation. So basically I have an s matrix here as k, k prime on the left-hand side and on the right-hand side I have the same gradient that I showed you before the gradient of the energy. So this s k, k prime is known as the quantum geometric tensor. It has the structure of this expectation value or statistical average of the log derivatives. It was in the context of variational Monte Carlo, it was first proposed by Sandro Sorella in the context of imaginary time evolution where he showed that this object, if you essentially able to generate the imaginary time evolution, you would just get rid of this i. What is interesting is also that this quantum geometric tensor essentially it carries information about the metric of our answers. So essentially imagine, so you know very well that the space where our parameters live, our w parameters live, imagine I have a configuration w and I take another state w prime, which is equal to w plus delta w and they are very close to each other. So the distance between those two states in this space, which has an Euclidean metric will be just the norm of delta w, right? However, what we are really interested in is the Hilbert space. And my neural network actually what it does is that it takes a configuration and gives me a psi of w, right? And it is possible since the mapping is highly non-linear, neural networks are highly non-linear function, it is entirely possible that this state which is very close to the initial state in the space of the parameters actually is very far in the space, in the Hilbert space, in the space of functions. So what the quantum geometric tensor does is it tries to estimate it's a first order estimation of the distance between the wave functions parameterized by w and w plus delta w. So it carries this sort of information. In any way, what we can do is essentially we can recast at least symbolically this equation and we can solve it. However, solving it means that we have to invert S. S is a matrix, you can show that S has a positive spectrum. So the eigenvalues of S, K, K prime are in air and are bigger than zero. However, it is also a often a highly singular function. So it's not easy to invert it. In any case, if you can invert it, usually we try not to invert it and instead solve this linear problem with some iterative optimization algorithm such as conjugate gradients, mean res and several others. In any case, if you can solve it and you can determine this set of parameters to w that solve it, you just feed them back into the equation that we use to update the weights and then essentially at every iteration instead of computing just the gradient, you compute the gradient, you can compute the quantum geometric tensor, you solve the linear system and then you use the output to update your weights and you do this on and on. And these can be used to perform the time evolution of the system, the imaginary time evolution of your system in order to find the ground state more efficiently and so on and so forth. So with that, I conclude. Essentially, I just to sum up a bit, I have shown you that we can use neural networks to variationally encode an arbitrary quantum state or at least a physically relevant quantum state. We can estimate expectation values efficiently by doing Markov chain Monte Carlo sampling. We can also compute the gradient of those expectation values and in particular of the energy efficiently so that we can optimize it. I've shown you that we can recast the problem of finding the ground state and also the problem of doing the time evolution into some sort of optimization problem thanks to the variational principle. And this allows us to solve them with iterative methods. And to end, I would like to point out that if you're interested in all of those things, we have a Python package that we are developing which is called netcads. You can find it at the netcads.org web address where we implement most of those methods. There are several tutorials that teach you how to do it. It's very easy to use. Usually you just need to define your Hamiltonian, the variational answers of a neural network you wish to use and then the technique you want to use to optimize for the ground state or the time evolution and this kind of things. So yeah, with that I'm done. I think there are, I don't know how much time I have I will try to answer some questions. You can address a few questions. There were a couple of them that arrived regarding the second part of the lecture there in the end of the, actually three of them now. So do you see a question and answer box? Yeah, just give me a moment. So you have mentioned that can we use this to cast a grid in the sand to find the subtle point of an energy landscape? So in general, we do like, I mean, stochastic grid in the sand will go down. So in general, it will not stop at a subtle point and it's not so easy to say if you are at a subtle point or not. Usually subtle points are something you want to avoid because they slow down the optimization unless you use good optimizers, but it's a problem. But in general, no, we are like, we are trying to find the ground state. But I would also like to say that, I mean, but yes, like this energy functional that we are optimizing is, yeah, it's an energy functional that depends on the fine on the space of the variational parameters, but a subtle point in this space doesn't really mean that the system that you're describing had, I mean, the subtle point in this energy functional does not really mean but doesn't really have a physical meaning. So is there a way to estimate the subspace W we consider contains or at least is close enough to the real ground state? Yes. So I guess the subspace of the Hilbert space, with subspace W, you actually mean the subspace of the Hilbert space that our answers, our variational answers is describing, not the variational manifold, which is just like some tool we are using. So first of all, the lower the energy that we can estimate, the better the approximation is. This is already an indication. So if we want to benchmark against over techniques, this is very useful. But if when we are going into the realms of unexplored realms, for example, two or three dimensional systems where there are fewer results, for example, what we can do is, I didn't talk about it, but we can, so let me see. Yes. So when we estimate E-lock, E-lock is, yeah, here I have the definition. So E-lock of sigma, if my wave function psi W is the ground state, is exactly the ground state, then I expect that when it is easy to prove that E-lock for any possible input sigma, they are all the same and they are like E-ground state. So essentially the variance of this distribution and therefore the error by your statistical error and your estimate will go down to zero. This is called the zero variance principle. And therefore it's quite easy to see that you reached, this is actually not only for the ground state, but for any eigen state of the Hamiltonian. But since in general, we have looking for the ground state, unless there is something very pathological, if you hit this condition, you know that you are really at the ground state. You can also use this technique and with some tricks based on symmetries to actually target excited states. Yeah. I have a particular classes of Hamiltonians that can or can't be calculated using the neural quantum technique. So in general, there are Hamiltonians that are harder to train for, but in general we have two tools essentially. You give me, given an Hamiltonian that we try to, for which we try to find the ground state, we first, we can try to cook up a good neural network that should be able to represent the ground state. Of course, if I know that the ground state should respect some symmetries or master percent fermions or bosons or whatever, I will change the architecture. I will not always use the same architecture. And therefore like this already changes the tool I'm using to solve the problem. In general, I'm not aware of any particular Hamiltonian, but like there are some Hamiltonians that are for which it's harder to solve the optimization problem, but it's also related to the answers. We are still trying to completely understand what makes the optimization procedure hard. So it's a, this is still an open research question to know exactly what is preventing us from solving the optimization problem. If it's the answers we are choosing, if it's the Hamiltonian. Yeah, can we extend those finite temperature calculation? Can we extend this to finite temperature calculation of expectation values? Yes, indeed we can. We can also extend it to determine the steady states or the time evolution of an open quantum system or dissipative systems. It's not particularly hard, but in just 45 minutes it's hard to talk about all those generalization of this technique. Can you comment a little bit more on why the neural network variation on answers is better than over kind of answers? Is it related to non-linearity? So I'm not saying that neural networks are necessarily better than over answers. This is still a question we are researching on. For example, in one dimensional, for one dimensional system, we know that matrix product states are extremely efficient and would be very hard to beat and the optimization is also quite simple. For two or three or four dimensional systems, already like variational answers and neural network are quite, I mean, they're very general and therefore they can work very well. And also we can take years of research done by the giants in machine learning such as Google and IBM and several others and exploit them to, because we have interesting structures that allow us to address some problems and actually encode in our architecture some symmetries of the system. So in general, there are theorems that tells us that neural networks are able to capture arbitrary correlations. So even volume low entanglement in a system, which is something for example that in 2D NPS cannot do. It's very hard to, I mean, we are still trying to understand exactly the limits of those techniques, but there are theorems that tells us that the neural networks are arbitrary, very good function approximators. So it's a, I mean, they perform very well, but I'm not saying that it's the ultimate technique. Okay, I think now it's maybe time to stop. Okay. Make a break. I don't know if there are a couple of more questions. I think, I don't know if you want to answer them maybe by tapping me, but I think we should take a break now. So we are back at 1.45. So in 10 minutes, we start the lecture by people. So is it fine? You can check the questions and maybe you can answer them directly. Yeah, okay. I'll just write down the answer or? Yeah, yeah, there is an answer break. I mean, I can go on answering, I don't know, as you want. I don't know. For me, it's fine, just I don't know whether other participants are fine for it. So we should prefer to answer them like that. If you think you can quickly answer them, yeah, let's go and then we break and then we stop afterwards. However, we would like to start at 1.45. Okay, so let's just stop now and just have to give this talk and. Okay. Okay, good. So we are back in 10 minutes here. Thanks everyone. See you soon.