 Hello to everybody. This is the last lecture of this bunch of this session, let's say, of the High Science Guide to the Condensed Matter and Statistical Physics. For us, it's a pleasure to have here Juan Carrasquilla. Sorry, I should do it too. And Juan Carrasquilla, now it's a faculty member at the Vector Institute. And it's also assistant professor at the University of Waterloo, read it correctly. He's working at the intersection between Condensed Matter, quantum computing, and machine learning. She's there in the middle. She's there in the middle. And he's a former diploma student of the HTTP. He's a success history for our diploma. And also a former study of PhD here at CISA, always interested. From that, I think he has moved and starts doing several post-docs in the US. And now he's at Waterloo. So please, Juan, whenever you want, you can start. OK, thank you, Alex, for the introduction. Yeah, so it's a great pleasure to be at ICTP, to be back at ICTP, although this is an online talk. I'm always very happy to come. So I want to tell you about recurrent neural networks and how we've been using them for many body physics. So the first part is going to be about basic notions. And yeah, let's get started. And so here are my people funding my research and the vector institute on University of Waterloo. So what are recurrent neural networks and why they're useful and why we think they're powerful tool to study many body physics? So let me tell you first what they are. So some generalities about what they are. So they're a family of neural networks that are naturally suited for processing sequential data. So this is something, this is how people talk about this RNNs. But you can process any data, basically. But in principle, they're naturally suited for processing data that comes in a sequence or in an order, right? And so the sequence that I'm going to be using throughout my talk, if you want, is like a sequence of, say, spin variables, sigma 1, sigma 2, through sigma n. And you can imagine them there placed on a line, if you want. Or you can think of this index i, 1, 2, 3, 2, n as time, for instance. And so the idea is that recurrent neural networks can typically scale much better to longer sequences than would be practical for other neural network architectures that you may have seen already in the last couple of weeks. Like there are strictly Boltzmann machine, which I'm sure you've seen last week and so on. So there's this specialization for sequences with the RNN or for the RNN. And so this is part of this, if you want this conceptual idea that you can, by exploiting the structure of the problem, that you can make problems computationally, right? So we're going to be using this idea a lot. And people in machine learning use it all the time, right? Like that when you have a problem, you want to see what's the structure of the problem. And you use the information about the problem to solve it, right? And this is something that we in physics do all the time. We use symmetries and so on. And this, if you want, enriches our understanding of the problem. But also it saves us, if you want, computational time. And so we do the same in machine learning. People have been using this idea in machine learning as well. So most recurrent neural networks can process sequences of variable length. That's important. So this sequence that I have here, the vector sigma can have, say, either two spins or n or n plus 1. And this architecture can process all those variable length sequences. So that's important. Like, for instance, in language translation and in generally natural language processing, which is one important task people in machine learning are interested in. So what to do? So to go from multilayer neural network, such as like the RBM or the restricted Boltzmann machine that you have seen last week, I believe, what you do is you share parameters across different parts of the model. So and I'm going to explain what this is in a little bit. And this is also what people use in computer vision and in another famous architecture called convolutional neural network, which is right here as a CNN. There's this idea also very popular in physics. I'm in machine learning, which is share the parameters of a model as you say, process things over time or spatially. You share the parameters, and this makes the parameters or the neural network more compact, but also more efficient at processing some sort of data. And so if we compare this so-called feedforward neural network, such as the restricted Boltzmann machine, so what you have is something like this, where each input, say sigma 0, sigma 1, sigma 2, sigma 3, is a process with their own set of parameters. So for instance, I have highlighted here sigma 0 and sigma 3. And as the information about these variables is processed through the architecture. In this case, vertically, you process sigma 0 with its own set of parameters, which I denote here in yellow. Whereas sigma 3 is processed with a different set of parameters here in purple. And so that's how you do it in a feedforward neural network, but in a recurrent neural network, you do something a little bit different. Like you process sigma 0 with variable w, for instance. I'm going to be explaining all these thoughts in a little bit more detail later, but the important thing I want to highlight here is that you process the sigma 0 with parameter w, and then you process sigma 1, and then you reuse the same parameter w, and so on and so forth for sigma 2 and sigma 3. So each input is processed through the same parameter, say u and w, where u are those here. I'm going to be explaining this in detail, this architecture. This blocks here are the so-called recurrent cells, and I'm going to be giving you examples of those, or one example of those. And so if you want, this parameter sharing is what makes this RNNs powerful and compact, and also enable you to have this variable length. So that's how you, say, do this conceptual leap of going from this more traditional feedforward neural network into the RNN with parameter sharing. So I'm going to introduce recurrent neural networks as models for probability distributions. So this is not the only way you can use them. You can use them in any other ways, but this is, if you want, what I'm going to focus on. And so let's get started with the definitions and the math. So consider a probability distribution defined over a discrete sample space. So the sample space is sigma, where is sigma our collection of variable sigma 1, 2, 3, to sigma n. And each of this sigma n can take d new values. So it can go from 0, 1, 2, 3, up to d new minus 1. So that's, if you want a spin configuration, if sigma, like spin 1, have configuration, for instance, or spin very easy variables, sigma, where you have 0 or 1 or plus or minus 1. But you can also do higher dimensional systems. So the probability of serving sigma, p sigma, is just this quantity here of sigma 1, sigma 2, through sigma n. And so one important thing that I'm going to be using is the so-called chain rule of probability, which allows you to write down p of sigma in terms of its conditional. So this is very, this is general. This is exact. So it's p of sigma 1 times p of sigma 2 condition on sigma 1 and so on, all the way up to p of sigma n conditional sigma 1, sigma 2, all the way through sigma n minus 1. So this is important. And is the basis of how we build the RNN, OK? So I'm going to introduce this shorthand notation for the conditional sigma i condition on sigma 1 through sigma i minus 1 as sigma i condition on the sigma less than i, OK? This is a short for conditionals. So here, there's one important observation, which is if you wanted to specify every conditional in the chain rule, so this gives a full description of any possible distributions of this type, OK? However, this characterization is very expensive. In general, the representation either directly writing all these probabilities or writing all the conditionals is our presentation that grows exponentially with the size of system n, meaning that if you specify all the possible values for the conditional distribution, this is an exponential number of, it's a big table with an exponential number of entries. And so there's no hope we can use it for practical purposes, right? And the question is, can we alleviate this explosion? And the idea then is, yes, we will and will exploit the structure of the problem to alleviate this complexity, which is something that we, in physics, we do pretty much all the time, right? Like, for instance, in tensor networks, what you do is you have this exponentially complex, complicated wave functions. And then what you do is you write an approximation in terms of, say, matrix product states, which, if you want, alleviate that complexity. So what we do here is the same, right? Like, it's similar in that we're trying to alleviate this exponential complexity by exploiting some structure of the problem, right? And the idea is that nature is benevolent and real-world problems have enough structure that we can use it, like when we can use far fewer resources to solve these problems or to tackle these problems, right? And so RNNs, what they do is exactly that, which is they can parametrize this probability distribution piece, sigma entirely through the conditionals, right? Like, so basically what we do is we are parametrizing each of the conditionals in the chain rule here, okay? So that's what the RNN does in this. RNNs do many things, but this is, if you want, the way I'm gonna be explaining them. So then the elementary building block of an RNN is a so-called a recurrent cell, which is, if you want, helps us specify the condition, okay? And the simplest recurrent neural network, people call it vanilla RNN, because it's the simplest, it's just a non-linear function that maps concatenation of a BH dimensional vector with an input vector sigma that represents the, if you understand configuration or this variable up or down or one or two. And then you map this concatenation through a non-linearity and then you end up with a new vector HM, which is this here. It's also DH dimension. So that's all it is. It's a non-linear function applied to a linear transformation or a fine transformation on this state vector or hidden vector concatenated with a, okay? And that's all what that's all it is. And F is a non-linear activation function, typically a sigmoid function. And then what are the parameters of the RNN? So it's a weight matrix, W with dimensions, they're usually real, but you can also use complex numbers. So it's DH times DH plus DV, which is, so DH is the dimension of this hidden vector and the new is the dimension of the input, okay? And then the bias vector is, it's in the RDH, the dimension of the hidden state. So the state, so we're gonna be using this expression as a recurrence. You go from a hidden vector HN minus one to HN. So this is a recurrent relation. And so we have to initialize H zero and sigma zero to something to get the calculation going and we initialize them to zeros, but any constant vectors work. And then something that I didn't specify is like, I have this sigma in bold. I don't know if it was difficult to notice, but it's just a vector that represents each spin configuration. And what these are is they're so-called one-holt encodings of the input. For instance, so if you have some input that is either one or zero, then I'm gonna replace that by a tiny vector with dimension two, where one, for instance, where zero corresponds to one zero and one corresponds to zero one. And so that's the one for convenience. So then, so how do we proceed? So we proceed computing all these conditionals that I was mentioning in the chain rule through this expression. So each condition at sigma n, condition on sigma one through sigma n minus one is a dot product Y n times sigma n, this one-holt encoding, where this dot is just a scalar product and this Y n is yet again an extra layer in the RNN. And I'm gonna be explaining how this looks like graphically to make things a little bit more fun. So it's a softmax, so-called softmax layer where you take this recurrent vector H n, you add this vector and then you apply this non-linearity called softmax. And so the softmax activation function is given by this exponential of the components of whatever you put there, divided by the sum over the vectors so that it is normalized. Juan, there is a question related with the previous slide. I think we can stop and we can go to the, can you elaborate the condition on the dimension of the weight matrix? Why is that, is that much specific? Let me see, so on the condition on the weight matrix. So, yeah, so we have this weight matrix here and so what it does is basically, it contains the parameters of the model and the dimensions of it is, they're related to the power of the model, right? So it's here. So the power of the model, if you want the expressive power of the model is encoded in this dimensional, like the dimensionality of the hidden vector H, which is this DH here, okay? And so this is tied to the size of the matrix. And so the dimensionality of this, if you want this weight matrix is this one here and it's tied to the expressive power of the model. I don't know if that answers the question, maybe we can allow the Rajiv to be more specific. Hello. Hello. Hi, Joanne, you answered my question. Thanks a lot. Yes, I got my answer. Okay, great. And then there's a question by Shionee. So I was wondering where, or how does the fact that the hidden layers are connected? The hidden layers enter the activation function. Let me see. So the hidden layers enter in the activation function through this expression in here too. So the hidden state of the model is H, right? And you apply this non-linearity to the concatenation of H and the input. And that's how you process information in this architecture. And so yeah, so it goes through, yeah, indeed through the, I mean, this is how the activation function is connected to the non-linearity is through this recurrent relation. Okay, hope that answers the question. Thank you all for the questions. So yeah, so I was here, I was at this extra layer of the recurrent neural network, which is the softmax. And the softmax ultimately provides the probability conditional, right? Because so this expression, as long as you and see are real and as long as everything is real, then you have that this S, the softmax layer gives you a conditional distribution or a probability distribution that we interpret as conditional, okay? And that's how you get this conditional distribution. And then ultimately, you have done all this like the computation of all this conditional sequentially, what you do is you can then compute the probability of the entire configuration sigma by multiplying. So the chain rule had all those multiplications, right? Like all these conditionals come multiplied. And so that course wants to do this multiplication here. So there's three important things that I want to highlight that I think are very powerful about this model, which is, so one of them is the fact that P is normalized, okay? As opposed to for instance, like energy based models or probabilistic models based on an Eastern Hamiltonian where you need to compute the partition function if you want to get say P of sigma, which is very challenging for some of these models. So in this case, there's no issue, you can, I mean, the model is normalized, which is I think a very powerful thing to do. And then sampling the probability distribution is achieved in a sequential fashion. So you sample each conditional sequentially and then at the end of that process which you do say n times, then you are guaranteed your, have an exact sample of the model, okay? So there's no need for say, there's no need to perform Markov chain Monte Carlo simulations to get samples from this model, which is again, very powerful, I think. And so I wanted to highlight those two. So this is a graphic representation of the RNN. So here's the input, you send the input to the cell which is this expression here that I was referring to. You compute this recurrence and then you use this vector hn to compute the conditional using the softmax. And at the end, you get this nice conditional, this parameterization of the conditional. This is, if you want a graphic representation of the fact that this cell is used on and on. So you take this hn minus one, you send it to the recurrent expression and then you get hn. So this is a graphical representation, this is the most compact, but there's also an unrolled version of this cartoon which is you initialize at h0 and sigma0. You do this little calculation in terms of this recurrence relation and the softmax which gives you the conditionals. You get p of sigma one, you then use sigma one to compute p of sigma one condition and sigma, sorry, p of sigma two condition and sigma one and so on. So this is kind of like an unrolled version of the recurrent neural network where each box is either this expression with the corresponding hidden vectors hn minus one and hn and the parameters of them. And this is how you, like if you want how you compute this distribution, right? Like the sigma, you just take, for instance, if you're computing the first term in this expression what you use, you take this y one, they've got the computation of the first step of the RNN and you multiply it by sigma one and so on. You keep doing this on and on until you get all the distributed, like the value of the probability of some configuration. Maybe there's three questions, maybe I can answer them right now. So I was wondering if the activation function relate to the partition function and what, so the question is, I was by Joseph. So I was wondering if the activation function relate to partition function and why? And I think so it does, right? It relates to the partition function because when you compute the partition function it's a sum over all the possible values of sigma and that gives you one that by construction. And so, let me see. So okay, the model is normalized by construction. So the partition function is not even there, but if there was a partition function you could write it in terms of the activation function. That's what I think, but the partition function is, I mean, the model is normalized. So there's no partition function per se. And the second question is, do you apply any cutoff on the conditional probability? Okay, so this is a good question. So do you apply any cutoff on the conditional probability? Say P of sigma N is equal to P of sigma N condition only on say a few of the variables. And the answer is, so in principle, no, but because I mean, we never apply this so-called Markov condition. So we don't use it in the definition of the model but in practice as you process information this vector H that is passed between the different steps in the unrolling of the RNN. I mean, this can only carry some certain amount of information. This is why this vector is called memory vector. And so in practice what happens is that if you have dimension D H of the hidden vector that is not too big, then practically the correlation between one variable sigma, say sigma one and the correlation between some sigma I, so one plus a, then becomes weaker and weaker. Okay, becomes, is actually, I think for most architecture it decays exponentially. So it's similar in the sense, yeah. So it decays, it depends on how powerful the model is, but in practice there's some cutoff. Like it's just that this cutoff is not imposed by hand by using some of these Markov conditions that Mahmoud is talking about. That's a good question. So maybe I can answer one more. So the statement that the samples aren't correlated is really powerful. How does this relate to numerical position when calculating an observable in comparison with Monte Carlo methods? So this is a good question. So the idea is that when you do Markov chain Monte Carlo you have some autocorrelation time that you have to, if you want use or to explore such that when you compute expectation values using the samples out of Markov chain Monte Carlo are correct, right? And the idea is that if you don't explore that and you use correlated samples then you start introducing a bias in the expectation value of your quantity of interest. What I'm saying is in here this autocorrelation is zero basically. And so you know that given the architecture you have given the model you have this expectation values are unbiased, right? Whereas in Monte Carlo in general they are biased because I mean guaranteeing like that you're not using correlated samples is difficult though in practice what we do is we do some sort of beaning analysis so that we make sure that the samples are really uncorrelated but what I'm saying is that you just don't have to worry about this type of analysis anymore because the samples are uncorrelated. So that's a good question. There's also one question on what condition this becomes unstable as nonlinear signal is being considered. I'm not sure I understand the question can you speak Shree Harisham? Yes, here in this scenario you are trying to use the sigmoid function, right? Which is a nonlinear signal. So at which condition this becomes unstable. So there are if you are considering like dynamical systems or in this case like there will be like some stability at what point of time this system is stable and what point of time this becomes unstable what is the condition? So I don't have a good answer to that question so but I understand so this. Makes sense or not I'm not sure but I just wanted to give you a try whether it is correct or wrong. No, it is correct and so I think so I don't know of examples where this happens so I understand where you're coming from can be understood as a dynamical system but as far as I know for the systems that I know every time so you can also think of this as a map, right? It is a map, right? And so what you like typically what happens is you float towards some fixed points. That's what happens in my experience is you always flow to either A or B point or C, right? So I've never seen instabilities, right? What I see is some form of saturation in this and this models when understood as a dynamical system. Okay, so this area still needs to be explored, right? This point of concept has to be still explored. I think there's some work on this along like the like the dynamical properties of RNN but I'm not very familiar with it but in practice I've never seen because everything is properly normalized. There's no, I mean, there's no explosion of anything in here. Okay, this one will be in periodic sense or this one will be non-periodic? I don't think it's not periodic as far as I know. Thank you, thank you. Yeah, no problem. So there's also a question, so can sigma be a matrix? I think so, it can be a matrix. You can indeed reshape, I mean anything can be a matrix in some sense, but yeah, it can be a matrix. All right, let me go ahead that I'm running out of time. So this is how we compute this. Then how you sample is a little bit similar, right? Like so what you do is you initialize H0, sigma zero, you compute the first conditional P sigma one, then you sample this. You can use a random generator and it's just a sampling either up or down and you're given the probabilities. You sample, you get sigma one, like the first sample, sigma one, you bring it here, you input sigma one here and you compute P sigma two condition on the sigma one that you observed here. You sample again, this is again two numbers, P of probability of up and down. If this is a two-dimensional system, you get sigma two, you store it, you bring it here, sample P three condition sigma two and one. You get sigma three, you bring in here, you keep repeating this on and on and this is how you get a sample. So it's very easy to, and it gives you exact samples. That's how I was saying. So that's it for probabilities. Let me quickly go through, how can you send this to quantum many-body systems, okay? Which is the next part, which is RNN way functions. So there's an important, so let me first tell you about a first, about an important class of so-called stochastic Hamiltonians or stochastic many-body Hamiltonians. They have ground states sigma observed psi with strictly real and positive amplitudes in the standard computational basis, sigma. So now I'm promoting sigma to quantum basis, if you honor basis set. And so now they're this cat. And so you define the ground state of a stochastic Hamiltonian, like in this form, and you can interpret that ground state as the square root of a probability distribution without, but like when you restrict to Hamilton, to ground states of Hamiltonians that are so-called stochastic Hamiltonians. And so it is natural then to use the recurrent neural network to represent this way function, right? So what you're doing is kind of like a coherent superposition of the RNN, and that's our RNN way function, which we explored in this paper by my graduate student, Mohamed Kibbat Allah. So however way functions are complex in general and so we need a phase. And this is a simple way to introduce a phase. So let me tell you about how we did it. So what we did was, so this is the usual, the RNN that I explained a little bit earlier, where we just make a coherent superposition of the probability, the square root of the probability. So you can define a way function or this way function through the usual RNN. However, if you wanna have a complex value way function, what you can do is you can add an extra layer on top of the softmax. So here we have the softmax layer, which is what we had originally. Now we have a so-called soft phase or soft sign layer that computes a phase or a phase given for every conditional, okay? And that's what we've done here. So it just adds a few more parameters to the RNN. And then you use those parameters to estimate some phase, phi one, phi two, phi three, all the way through phi N. And then, let me give you the details. So the soft sign layer called soft sign is basically this expression here. You multiply by pi times this soft sign function, which is basically this expression here. It's x divided by one plus absolute value of x. So this is between minus one and one. And so if you multiply it by pi, it looks like a phase. And then the overall phase of the way function is just the sum of all the phases at each site or at each spin, okay? So now let me, I think in the last few minutes I have, let me tell you, so that's all about the architectures and the basics of the RNN. But let me tell you how to train these models. And I'm gonna give you hopefully two or three examples of how you can train this. So in machine learning, which is that, if you're on the most traditional way people do it in machine learning, is you take a big bunch of data from a probability distribution, you observe in nature, such as like images on the internet or a collection of words in a book. And then you can use the so-called maximum likelihood estimation or maximum likelihood principle. And so what you can do is you to estimate the parameters of the RNN, we can use this principle. This principle is very simple. It's also a very deep idea. So the parameters of the statistical model, which is our recurrent neural network are selected by assigning high probability to the data you observe. So this makes sense because you imagine you're sampling some probability distribution, for instance, the probability distribution of natural images. So the images you take with a camera. And so what you get out of the camera, it's, I mean, it has high probability of occurring because it happened, right, like in nature. So that's the idea that when you sample experimentally from a probability distribution, the observations that you see have high probability. And so we should assign them if you're fitting that data to a probabilistic model, we should give them high probability, right? That's the idea. That's the principle of the maximum likelihood estimation. And so you're given a data set, sigma hat, which is a collection of data points, sigma N. And then what you do is you compute the probability of that data under the model. And that's called the likelihood, okay? So, and you assume that this samples you got in the data are uncorrelated, okay? And so the probability of observing that data set is the product of the probabilities assigned by the model P theta. So P theta is gonna be the RNN. And where the parameters of the RNN are encapsulated in this variable theta, okay? So that's the idea. So you can maximize this quantity with respect to theta. However, it is, so since probabilities are between zero and one, if you multiply too many of them, this number, this likelihood function is gonna be eventually very small, okay? And so what you do is you say, okay, instead of using the likelihood function, I'm gonna define for numerical convenience the negative logarithm of the likelihood, okay? Which is simply you take the minus the log and then you end up with this expression here, okay? Which is way better to handle numerically. And instead of maximizing the likelihood because you put a minus sign here, then you minimize this negative of likelihood, okay? And what you do is you use gradient descent techniques which you saw, I think, last week, okay? So what are the ingredients of this optimization or this minimization using gradients? So first you have to compute the gradient, the theta of this negative log likelihood. And then to do that, you use some algorithm called back propagation through time, okay, which is basically using the chain rule on that function defined by the RNN, okay? So I have the slides at the end, but I won't have time. It's very clear to me that I won't have time. And so let's just leave it like that, but I encourage you to take a look at this link. It has a derivation of this back propagation. So basically the derivation of the gradients of the RNN with respect to its parameters and this is, of course, by a colleague of mine, Roger Cross, very, very clear. And then what you do is you do a parameter update following this direction of the steepest descent, which is basically you replace theta by theta minus some small parameter alpha times the gradient and you iterate this until you get a convergence, okay? So the hard part is using this derivate using optimization because they can explode or vanish. And this is related to the question about this stability. So there's no problem in calculation of the probabilities, but the gradients can become very high or very small. And this is, if you want a consequence of understanding this system as a dynamical one. And then for one of you of the gradients, there's some issues. And so you need to optimize the architecture so that this is improved. And then finally, we're gonna be using as a different method of training in physics. We were often given Hamiltonian, right? Like we're not given data, but we also have, we often have the so-called Hamiltonian and so we can use variational Monte Carlo to optimize this answer with respect to some local Hamiltonian. Ah Juan, maybe before you continue because there are questions related with all the learning stuff with maximum likelihood. Oh yeah, yeah, let's... So maybe before I go on with the question, so how much time do I have? Do I have, it's already a break now, right? Or in a few minutes? No, not necessarily. I mean, I think you have still three. Okay, great, great, okay. So then I'm gonna have time, okay, awesome. So let me go ahead with the question. So as Lavi Kumar is asking, as you mentioned, Ireland is concerned about sequential data. Can we determine the time dynamics of the system? For instance, way function, sequence in time. So I think I'm not sure exactly what the task is here, but yeah, so you can determine some time dynamics in some systems and this has been explored as exploration along those lines of like using RNNs to explore real, like some time dynamics, right? So let me go ahead with a different question. So Sandy is asking during sampling, is it necessary to take account of fluctuations in sampling because Sigma one is fitted into next step Sigma two. And as you have said, there is random probability sampling. So yeah, so when you're doing sampling, yes, you account for the probabilistic nature of what you're trying to do, which is sampling. So actually, so you output this, say for instance, P of Sigma one in the first step and then you do a sampling step. So you roll a coin or you roll a die. And then depending on the outcome, you can get either up or down. And then you take that and you bring it back to the RNN to compute the condition, the P of Sigma two condition on that outcome, okay? And so if you repeat this multiple times, you will see sometimes you see up, sometimes you see down. And so you would have accounted for this fluctuations that Sandip is mentioning. Is there some stochasticity in gradient descent? So in principle, in gradient descent, depending on how you, on your cost function, on the training style, there's gonna be a stochasticity. So if you have a data set that is too large, then like you would say for instance, instead of taking the entire data set to compute this log likelihood, you would take only a fraction of it and use that to estimate the gradient. And so as you change the different pieces of data, then there's gonna be fluctuations due to the fact that you're not using the entire data set. But if you use the entire data, then there's no stochasticity in maximum likelihood estimation. But if you're doing variational Monte Carlo, then there's gonna be always, there's gonna always be stochasticity because you cannot sum over the entire Hilbert space. And so you will have stochasticity in the gradient. So that's a good question. Thanks. Valentin, will it work for any kind of Hamiltonians or any or only local Hamiltonians? So yeah, so this is for variational Monte Carlo, this is meant only for local Hamiltonians. If you have a non-local Hamiltonian, then there's an interactability in a calculation of the gradients and the calculation of the expectation value of the energy. So yeah, so it's local. And then second question, how the dimension of H define physical reason so that dimension of the RNN is so-called hyper parameter. It's a parameter that you fix by hand and what people do in practice is the higher, the better because it makes the model more and more powerful, okay? And the physical reasoning is because this H is the, if you want is the mechanism through which you express correlations in a probability distribution or in a wave function. And if you have a physical system with strong correlations, then you want to make this dimension high. Whereas if you don't have any correlation, for instance, if you have a mean field theory, then this dimension can be zero and this wave function becomes a mean field approximation. So this, if you want, that's kind of how I reason about it. Like if you make it zero, then this probability distribution is a product distribution or in the wave function case, this is a mean field theory. And as you make this dimension higher and higher, then you're capable of accounting for stronger and stronger correlations. Mahmoud is asking, sorry, I did not understand what approximation makes RNN behave linearly with data size. I don't understand the question. I don't think I made the statement that they behave linearly with data size, but if you wanna ask the question or clarify it, you're welcome to, Mahmoud. Hi. In the previous lectures, reducing the size of Hilbert space by a symmetric constraints or other locality constraints, we decreased the size of the data we need to run our calculations. I don't know what type of this approximation can be used here. My previous question in this course was the cutoff you apply on conditional probability was in this respect. And I don't know, I don't understand the point here. Thank you. Yeah, so we don't apply cutoffs. And the approximation that we make is that this like this, sorry, recurrent network representation cannot represent any conditional unless you make this dimension of the hidden vector exponentially large, which we want to avoid, okay? So if you made that dimension of the hidden vector very large, then you could represent in principle any probability distribution P sigma, okay? So the approximation is that by restricting the amount of correlation or the size of the hidden state, you impose restrictions on the type of conditionals you can represent. So that's what makes the RNN cheap. But also less powerful than a generic approach where you do table, for instance, which is exponentially big. There's a question by Tim. So for a sequence sigma one through sigma n, do we assume Marko property? I answered that question earlier, I think. So there's no Markov approximation. So there's an effective Markov approximation by putting a finite number in the dimension of the hidden state, which limits the type of like the distance over which you have correlations. But in principle, you don't do any Markov approximation. I think this is a very important question by the way. There's no Markov approximation. Effectively something like that happens, but it happens naturally because you kind of account for very long distance correlations. There's a question by Lavi. So just to understand better, the RNN approach you showed here is for the sequential data of the system, for instance, calculate free energy as the temperature is decreasing. Then we determine the ground state, is it right? That means another way to determine the ground state. So it depends. So are you talking about the ground state of a classical Hamiltonian or the ground state of quantum Hamiltonian? And in both cases, you can do this, okay? You can do ground states where it's zero temperature of quantum antibody systems. But you can also use these approaches to compute the ground state of a classical Hamiltonian, which is an example that I have prepared for the next session, like for the second part of the lecture, okay? Okay, so let me, so I won't go through too many details of the RNNs with Bayesian Monte Carlo. I'm gonna assume you have seen this and I think you have last week. You can use variational Monte Carlo to estimate way for the ground states of many body Hamiltonians. And you do that by simply computing the energy and its gradients, okay? The gradients are, they have this form. It's not too important. It's just we're gonna be using the same technique, gradient descent on the energy, expectation value of the energy of a Hamiltonian, a local Hamiltonian. And then we use the simplest parameter update that are more complicated update rules like stochastic reconfiguration that I think Lipovic and Dini explained last week and so on. But let me conclude with this, so which is a beautiful example is related to a question I just thought about whether you can get ground states of classical Hamiltonians and can you use this to say like for statistical mechanics problem. And so there's a way to do it and it's described in this very beautiful paper by Wu Wang and Zhang in Geryl last two years ago. So you can apply variational approach to statistical mechanics, okay? So what this is is you have a model distribution P data that approximates the Boltzmann distribution at a finite temperature P, okay? So you may wonder why do you wanna do this but let me just bear with me. So what you do is basically you try to like have this probability distribution P match the Boltzmann distribution. And you can do this by optimizing the free energy of the probability distribution P data and you minimize it with respect to data, okay? So what is this free energy is expectation value of a classical Hamiltonian or over this distribution P data minus T times the entropy of the distribution where H is a classical Hamiltonian and the entropy afraid I didn't write the expression is there's basically sum over the spin configuration sigma of P times log P, that's the entropy of the distribution. It turns out that if you use a recurrent neural network or any ultra aggressive model or any normalized model such as the RNN both the energy and the entropy are very easy to compute, okay? So then you can use the same strategy to approximate statistical mechanics problems, okay? So this is how you do it. So the free energy can be estimated from samples and I explained how to obtain these samples from the RNN and so basically you take Ns samples and then you compute this average over the so-called local free energy, okay? Which is basically f log is equal to H. So here I have the word target because of something I'm gonna talk about but H say classical the energy of the classical configuration sigma plus the log of the probability P of sigma. And as I said, I explained how to compute P. I like it's just this sequential process where you compute the probability of the RNN of configuration. So this is easy to compute actually. And then both the free energy but also the gradients are easy to compute, okay? So it turns out that the gradients of the free energy you can compute using samples from the RNN. So you take Ns samples and then you compute this one, okay? Gradient of respect to theta of the log of P on the samples that you saw times the local energy, this local free energy. And then the thing that I haven't explained yet is how you get these gradients which is again this back propagation through time that I referred to the lecture notes of older groups. In practice, we use automatic differentiation which is a very powerful algorithm that allows you to compute the gradient of any differentiable program you write. So it is very easy to implement so much you don't have to do anything. And then you use some simple parameter update theta equals theta minus a small constant the gradient of the energy. And you iterate this until convergence. And this is how you solve or you approximate solutions to statistical mechanics programs. I think that's it for now. And let's see if there are more questions. There are no more questions. And so I guess is it time for a break now, Alejandro? Alex? Well, we can do out a break now. I think some minutes as you need. I can go on. It's up to you. Well, I think if you break. Yeah, let's do a small break. Right, Tasia? Yeah, I think we can make a break until well, we have 15 minutes by the program. So, yeah, so we can have a coffee. Yeah, maybe a coffee. So see you again at.