 We have the last lecture of this whole school this week. Of course, next week there will be the conference, but for the school, this will be the last one. And this one will be given again by Filippo Vicentini and wish you a lot of fun. Hi everyone, nice to see you back. So I will be held by Amidao here with a PhD student in Paris, who has already done this notebook many times. He knows it by heart, so don't hesitate. I will be working around, he will be working around, hesitate to ask as many questions and harass us for anything you don't understand. There will be many things that you don't understand, I guess, because it's not easy, but that's why we're here. Okay, so this complicated spiral thing, you just go to this link, right? So my GitHub page lectures, there's a lot of stuff I've given, a lot of lectures I've given over here, the years, so the last one is CTP, you can click there. There's two notebooks, so now I think it will only have time to do the first one. But if you are very, very charged up, you can try the second one later on in the week also. I don't know if it's weekend, if it starts raining, you can try it. Anyway, today we'll be doing the first one at the top, there's a beautiful button that says open in Colab. So if you don't know how to set it up, just open in Colab, you just need the Google account, and then everything should just run on the cloud, which is beautiful. If you know how to deal with notebooks, you can just download the notebook and run it by yourself, okay? Good. So the idea of the notebook is to, is essentially to build a variational Monte Carlo code, so essentially a small code that minimizes the energy to find the ground state with some simple, not even neural networks, let's say simple variational answers and at the end one neural network. That's the first notebook. The second notebook would use stochastic reconfiguration or natural gradient descent, but as I said, I would not have the time to go through it. You will also be able to find online the solutions to those notebooks, but don't look at them, guys, not today. Yes. And the idea is that the notebook will allow you to build a code from scratch or let's say almost from scratch. So there's some very complicated or very technical stuff that nobody wants to implement themselves, like how to build some sort of sparse matrix where you can encode hundreds of spins, et cetera, like this is boring stuff, so you will use one package that we developed over the years, NETCAT, that does that for you. But everything else, you will be doing it by yourself. And the notebook will also show you how to use NETCAT to do it much easier, but what I want you to learn is how to do it by yourself because that's, you can see it really like hands on. So just to, let's say, to quickly show it. So what you can do is you can write my name, right? GitHub, Filippo Vicentini lectures. You will find it here. One of the first things showing up. You just go on the repository. That's one, right, 24-04 ICTP. You click here. You click on DMC from scratch. And then again, if you know, you can download it from here. If you want to work on it on your computer and you know how to do that with Python environments and everything else, for those that are not super familiar or if you don't want to lose 15 minutes setting up a Python environment, you just click open in Colab, and hopefully this works. Yes, beautiful. Okay, and to run this, essentially, what you need to do in Colab is to install a bunch of dependencies, which essentially you need to run this pip install NETCAT. And that will install everything you need. So to do it, what I will be doing essentially is I erase the comment at the beginning, okay, and you can run it. And of course, it will tell you that someone might hijack your Google account data. Of course, that's exactly what this NETCAT package does. It can steal all the papers you're working on. That's why we give those lectures. And this might take a minute or so. If you're working on your own computer, like you can just install NETCAT. You can install NETCAT on your own environment, whatever. And NETCAT will be installing JAX. So one ugly thing of this lecture is that we will be using JAX, which I know some people don't like. I know some people say, but PyTorch is easier. Indeed, but that's what I do. So you're obliged to follow me, okay? They will just give you a bunch of intro things and then I will let you go on your own, go through at your own pace. Yeah, well, just in case you run this, make sure that if you're running it on your computer, because if you're running on Colab, everything will work. But if you're running on your own computer, make sure that you have a recent Python version and the recent NETCAT version installed. In particular, Python 3.9, at least. And NETCAT, actually, you need the last one, 3.11.4, but whatever, okay? And since I prefer to work on my laptop, I would be giving the lecture from my own computer, okay? Okay, and you can see that I have a recent enough version of NETCAT, et cetera, et cetera. Okay, so now, first thing, so the notebook, like we import a lot of packages that we will use throughout Numpy. So if you're not familiar with JAX, there is this thing essentially in JAX for reasons that I will not get into, but essentially to use JAX, you need to replace all your calls where normally you would be using Numpy with JAX.Numpy, okay? So essentially, instead of using import Numpy as NP, we will be also using import JAX.Numpy as JNP. So JNP, JAX.Numpy is a complete replacement of Numpy that supports automatic differentiation and a bunch of over-fancy things. Simply use it in place of Numpy for everything you think. If you already use Numpy, just use JNP in place, okay? And before I get through all the, before I leave you, I just want to showcase a few things. So NetKit is both some high-level package that allows you to do variational Monte Carlo calculations and a dynamic simulation with neural quantum states where kind of the algorithms are packaged up for you and you only need to define the neural network architecture you want to use and all the hyperparameters. But NetKit is already a low-level package that allows you to construct like the Hamiltonian and the operators and to use those to implement your own algorithms. So, and a bunch of utilities that we use every day when we work on our research. So very quickly, I just want to show a bunch of things you can do. For example, in this case, the idea of a notebook is to essentially minimize the energy of this Hamiltonian. So that's a transfer fieldizing model. This is a stochastic Hamiltonian, so it's very simple. You could also try something harder later on, but, so essentially we have a coupling, a nearest neighbor coupling along the Z direction with intensity J and some transverse field, minus H along the X direction, okay? Yes? Yes? Yes, so as I said before, maybe I will write it here again. So you have to go to github.com slash Philip Wins slash lectures, okay? HTTP, whatever. One, two, go to 24, zero, four, ICTP, three, notebook number one, four, open in colab. Okay? So the transfer fieldizing model can be defined on any lattice, right? So for now I will be working on a very simple four by four lattice, so a square lattice. In that case, since we're very lazy, someone implemented a while ago a class that handles all the lattices that you can think of, like cubes, hypercubes, pyroclore, anything you can think of. If you want to thank someone, it's either Christopher Roth and Attila Zabo at Max Planck right now. So essentially you can do this very simple netkit.graph.hypercube and this allows you to define the length, the side of the hypercube, right? Two dimensions, so that's a square with periodic boundary condition. And the idea is that netkit is divided into some of those blobs, let's say, different groups of functionalities. So you have netkit.graph, which unsurprisingly contains utilities to work with graph. And you have netkit operators that allow you to work with operators. Like Hamiltonians, essentially. Netkit.optimizers, for optimizers, netkit.nn, which implements some machine learning, neural network stuff, and a bunch of other things. Netkit.samplers that allow you to work with samplers, et cetera. So about the graph, right? You can check how many nodes are in this graph, so essentially how many sites. And if you run this cell, you get that there are 16 nodes. You can see what are the nodes, right? And they're just a label from zero to 15. Zero, one, two, three, four, five. You use Julia before with miles. Well, again, I'm sorry, you have to bear with zero-based indexing. You can check what are the edges, okay? So what are the edges where we have interaction? So, and you can see them, right? So you have, I don't know, an edge between site three and seven, 12 and 13, et cetera, et cetera, et cetera. And you can always use these to build your Hamiltonian later on. And you can also draw it, because a picture is worth a thousand words, right? So, and you can see that this is periodic because you have a bound between, like you have an edge that connects site 12 and zero, for example, okay? Good. So, now, in that case, we use this Hilbert space thing. So essentially we want to define what is the computational basis. If you remember, in my last, in the lecture of the other day I had said, we are working with certain Hilbert space, right? And the basis of the Hilbert space for spin one-half, for n spin one-half, is essentially the set of basis elements as in up, up, up, up, down, et cetera, all down, right? So, this is what we call the Hilbert space and defines simply our computational basis, historical reason why we call it Hilbert space. And this is an object that you can use to work, to just get all those configurations very quickly without having to write it yourself. So, it's a bit, again, easier. So, you don't have to think about all those indexing things and it allows you to go from an integer that spans, let's say, from zero to two to the n, to one of those configurations that are stored as essentially binary numbers. So, in this case, I build the basis for spin one-half with n sides, okay? So, if I look at this, I can do, for example, hi, Hilbert dot all states and it will return me an array that has shaped 65,000 times 16. 65,000 because there are two to the 16, right? It is 65,000 something and 16 is the second dimension because I have 16 sides. So, essentially I have 16 binary values, right? Up or down, one or zero. In this case, one or minus one. Is that clear? Yes? Yes. I mainly want to know if it's not clear, actually, but. Okay. And so, essentially, if I want to evaluate the neural network or my variational function for one of those configurations, this is what we will be doing later. If ever you want to play with this, you can also convert those configurations, for example, to an integer number. So, you could do, for example, Hilbert dot states two numbers and this will give you all the, some integer that corresponds to those configurations. So, a number from zero to 65,000, okay? And you could also do the opposite. So, of course, you cannot do this for very, very large Hilbert spaces, but for those that fit inside of some 32 or 64-bit integer, you can do it and it's quite useful. And you can also check, again, the number of states that are in this Hilbert space. Okay. And now, the last thing, operators. How do we build up operators in NETCAD? Again, we could use sparse matrices. We kind of use internally sparse matrices, but we use a special format internally that allows us to implement efficiently the operations that we care about. And if you remember, the operation we care about, I had said yesterday, right? So, inside of our algorithms, you will be implementing some calculation of the local energy, which, if you remember, was essentially xh psi theta divided by x psi theta, right? And we'd said that if we insert here an identity, we have sum over y of xh y times y psi theta divided by x psi theta. So, these we will be computing later on. And we'd said the sum over y is the sum over all those configurations, but actually I can restrict it to being a sum over the y's where h x y is not zero, and so then I can just compute this ratio, okay? So essentially what we care about is a way to query those connected entries, I'd call them, right? So essentially, NETCAD operators allowed you to very easily pass in a bunch of configuration x's and get all the y's for which h x y is not zero. Is this clear? Yes. So in a sense, x are like rows, you fix the rows and you ask what are the columns that correspond to those rows where you have no zero entries and what are the values. It's essentially equivalent to a CSR sparse format, if you are interested in that. But where instead of using integers to label the rows and the columns, we use those bit strings, those configuration so that's up, up, up, up, up, up, down, et cetera. Because those are the objects where we will be evaluating the natural. So we never use integers also because for more than 64 bits it would be complicated to sort, like to work with, I don't know, 128, 256 bits. I mean, if you use Julia, it would be very easy but unfortunately we don't. And so we use that, okay? Is that clear? Questions? Good. So the way you can build operators in our syntax is simply to say NETCAD operator dot spin because we are working with spin operators, we're working with spin Hamiltonians right now, dot sigma x. This is the sigma x operator. You have to specify the Hilbert space, simply basically the computational basis and then you say on what side you want it to act. So this line here is building a sigma x operator acting on the side one. This one on side two. This one is sigma z acting on side two, et cetera, okay? Good. And you can look at those objects, like you can print them, they tell you like, you know, it's dimension 16. I don't remember, I guess it means the Hilbert space dimension, but whatever. Acting on, it means on which side it's acting, right? This is acting on side one. Constant essentially is a diagonal shift. It doesn't really matter and the type simply what data type is using to store it inside. And so now those objects you can think that they behave exactly as any other operator, right? That you find around. So I'm a real fan of like having some code that is easy to use. So those objects, you can literally sum them together, multiply them together and they will behave as the matrices you expect them to be. So now I will let you start playing with this. The first thing that you should do essentially is simply write a very small code to start building with Hamiltonian, which is the transfer seal, the Ising Hamiltonian for a two dimensional system using all those things, So in a sense, what you need to do is you need to look at the graph or just you can generate all the site indices yourself, right? And you can look onto every node, every site index and start to add the terms to the Hamiltonian and then you also need to do the same for the edges. And then once you have done that, you will be using the notebook will work you through exact diagonalization to compute the ground state energy and then a lot of other things. And as you go on, I will discuss some other things later. Okay? So do not hesitate to ask me things and ask me other things, we will be walking around. Please enjoy or suffer. Just, I had forgotten to add at the top of the notebook an import tracks.namp as JNP, which I had fixed. But in case some errors pop up, that's why. And if you're interested in the solution to this first point, well, we simply want to, so this first row here creates an empty operator. So it's zero, right? And then we simply want to write the first term in the Hamiltonian, so the sigma X. So I can do for site in G dot nodes, but even for I in range up to 16, whatever. And we can do Hamiltonian equal Hamiltonian plus netkit operator spin sigma X Hilbert dot site. And versus minus actually, because it has a minus sign. And likewise to write the nearest neighbor interaction term, we simply loop over the edges of a graph. Yeah, and okay, actually copilot knows me very well. You simply take the product of the two operators, right? Sigma Z on site I and sigma Z on site J. And now if you look at this, and now if I look at this Hamiltonian, now I think I broke everything, but you can see that now I have acting on 48, meaning that I have 48 terms in my Hamiltonian. And this should correspond to 16 local terms, and then 48 by minus 16 is 32. The square root of 32 is no, y 32, but and then 32 nearest neighbor terms, okay? Yeah, for this it doesn't, but for historical reasons, but yes, it's weird. So, and once you have done that, so there is this little thing here that was checking if your Hamiltonian is correct. And the way this checks it, it's like a tester that checks if what you do is correct, is simply that it constructs, we have some pre-built Hamiltonians, this is the Ising Hamiltonian, and it simply converts it to a sparse format, converts yours to a sparse format as well, and it checks that the difference is small, okay? So then the code, the example here will do something very weird that I will not get into the details of, but converts your Hamiltonian to what we call the JAX operator farmer, so an object that JAX knows how to operate on. So the first is an Hamiltonian that works, you can use it to do a lot of things, but it's a bit weird, JAX doesn't love it. So you can use it with JAX, but you have to be very careful. Instead those JAX operator things are something a bit new that we have, but it will allow you to use it in a much simpler fashion. So we simply do this and we will be working with a JAX Hamiltonian. Yes, one day we will do a paper about it that says the code will pop up, but right now it's not supported. Yeah, if you do it yourself, yes, like summing them and keeping track of what you can differentiate with respect to, if you, it's a, when we sum them, we do some fancy stuff to compress them and check, share the, like the sparsity pattern, essentially, and we match the sparsity pattern of different operators and doing this in a way that is still differentiable is a bit tricky. However, if you simply have two operators, multiply them by a pre-factor, this you can do. Just, it would be nice to do this automatically, it's possible, but it would require some effort, we will do it one day when we want to do that. Okay, and then what it was asking you to do is simply how do you compute the ground state, right? The ground state energy if you have, if you convert it to some sparse format. Well, you can simply do, okay, co-pilot, but you can simply take, like, okay, we have this netkit.exact.langsos.ed, okay, and you simply pass in the Hamiltonian, tell him you just want the ground state, so the first value, and you compute the eigenvectors. Yeah, okay, co-pilot is not so smart, actually. And so, yes, and so what this will do is simply it will give you the ground state energy, right? And the vector of the ground state. And you could also, now I don't remember exactly the syntax, but you could also ask him, yeah, okay, you could also ask him to compute more than one eigenvector. So for example, in this case, I will have three eigenvalues. So the ground state energy minus 34, first excited state, and then another one, okay? But in general, you just care about one. Another thing you can do is always convert it to a sparse format and use sci-pi as well. But in this little test here, I was checking that you actually wrote again the VEE thing correctly. Okay, so the following part of the notebook proposes you to write some very simple variational ansatz and in particular, a variational ansatz that encodes a product state or a mean field state. That is positive, okay? So what it does, it's like, we will encode this product of local wave functions. Now, if we assume that the ground state is real because this Hamiltonian is stochastic, this means that in principle, I have to know for every side, I have to know psi i sigma i, not sorry, psi i up and psi i down. And those are two real numbers. But if I enforce the normalization constraint, which I'm not obliged to, but if I decide to do it, well, actually I only have one number to know for every side. So my ansatz will have n real values, okay? And again, since in NetKit, so when we work with those variational ansatz, there's one person, maybe in the one organizer that likes to encode the wave function itself, but most of the community encodes the logarithm of the wave function in the neural network. And so what you need to encode in your model is the logarithm. So instead of having this product, this would become a sum, okay? And so what this notebook will show you here, it's a lot of code that you can use to define neural networks in this language. And then how to use it. The wave neural network definition goes, are taken from this package called fluxer, which I try to document a lot. But essentially, it's a lot of boilerplate, but the idea is you need to define a class, you call it however you want. You need to define a method, like the call method, that evaluates the neural network amplitude, given a set of those bit strings. And so you can define the parameters. Here we exploit the translational symmetry of the Hamiltonian, and so we take only one parameter, lambda, because we assume it's the same for every side. And we take this log sigmoid activation function in a sense, which essentially encodes the logarithm of this. So it's exactly the value of the wave function, the log wave function at one side, okay, for this product state, and then we simply sum over the last axis, why? Because if I have, in a sense, imagine that my input x's are, let's say, x, I have like this up, up, down, up, up, up, up, up, down, down, down, up, down. So this is a batch of samples, right? So this is a matrix of real numbers. So the rows of the matrix are the number of samples, and the columns of this matrix would be the number of sides that I have, so I'll squares. Do you agree with that? I hope so. And now when I evaluate the wave function, what happens? I evaluate this function, this log sigmoid. So on every one of those numbers, and so I would be getting something like phi one of x one, one, so x one, one means it's this first spin here, this first spin of the first sample. Then I will have phi two of x two, one, phi n, phi l square of x l square one, right? Because essentially I'm evaluating the local wave function, the product state answers on every side for the first sample, and then I would be doing that for the second sample, for the third, et cetera, et cetera. Is that clear? So now I get the matrix, but in a sense, if you think in terms of tensor networks, those would be the local coefficients on every side. So kind of like the local tensor contracted with a configuration x one, this would be a local tensor contracted with a configuration x two on every side. So for every set of samples, so this is for the sample, for the second sample, et cetera, okay? And now, normally I want to take the product, right? However, since I'm working in like this, I want the log wave function, and actually, I'm sorry, I was wrong here. I don't compute phi, I actually compute log phi, log phi, log phi, log phi, et cetera. Since I'm operating in this kind of log space, I simply sum them. So when I take the exponential, it's as taking the products of all those fives. Is that clear? And so I take the sum over the rows, and I multiply by one half, which is simply like, because I take the square root of this value, because it's the wave function, and that's my answer. And then the way we use those neural networks is first we have to instantiate this class, okay, this is just boilerplate, then we can initialize them. So we need to feed the random key, which is essentially a random number generator. I have to feed a sample configuration, so here I generate a random set of random numbers of this size, and if we look at the parameters, it's sort of like in this pi three format, so this dictionary of dictionary of stuff. And if you see, you have params, meaning the parameters of a network, and lambda is this single number, okay? It's a very stupid neural quantum state, but neural quantum state nevertheless. And here it shows you how you can operate on those trees. Essentially, you can define functions that operate element-wise on every element in the tree. So for example, if you want to sum two trees, like because at some point, we will want to sum two sets of parameters, right? You can simply define a function and use this three map. It essentially applies on function to every set of arrays inside of a set of nested trees, nested dictionaries. And you can also essentially add the two trees together by doing three map dictionary one and dictionary two, which will sum the corresponding elements from the two trees, okay? So this is a technicality, but it's actually very useful in a sense. Every time you want to add two sets of parameters, since we store sets of parameters, not in a continuous array, but in some dictionary of dictionary of dictionary where every dictionary has the parameters of one layer or of another layer, instead of doing add, like A plus B, you have to do juxt.tree.map of add A and B, okay? Just boring, but that's life in a sense, okay? And the reason we store parameters not continuously, it's because while if you just have one function, it would be trivial to do so. If you have a neural network with a lot of layers, you share parameters, et cetera, it would be very messy to write down the code to do that technically you could, but it's much easier to just have dictionaries, okay? And now that we have, if you remember, we initialized our model, so we have the parameters, and then to evaluate the network, we simply do model.apply that uses the code that we implemented before and we pass the parameters and the inputs, okay? So if now my inputs is a configuration, right? Of Mrs. Bronga, I think, oh no, yes. So if now I generate four random states from the Hilbert space, right? So my inputs will be a matrix of shape four times 16 because it's four configurations times 16 sides. Now if I evaluate the wave function by doing the model.apply, I get four configurations, four amplitudes. So psi of one, log psi of one, log psi of two, log psi of three, log psi of four, is this clear? Now, what this thing asks you to do is you can try to implement yourself a small code that converts this representation, okay, where you have your like neural network that you can query down to the usual array representation that you use all the time in, I don't know, Qt for your favorite code. Usually, right, you store the wave function as a vector. So for those small Hilbert spaces, for those small problems, we can still convert one from the other. And so what you can do is you can write a function, right, that takes the model, the parameters, and simply returns the model evaluated for every possible configuration. So what you can do is, again, copilot works for me. Essentially, I take the exponential of model.apply parameters and over all configurations that I take from the Hilbert space, okay? So essentially what this does, it will evaluate my network for all true to the power of 16 configurations and take the exponential because we encode the log wave function, okay? And if you run this little code below, it was testing that this actually works, okay? And now, in case you want to know why we use JAX, essentially, if you time this code that converts your model to an array, for example, it takes seven milliseconds on my computer, okay? You can also jit it, which essentially means you ask the JAX to convert it to some C-like code and compile it. You need to tell him what are those static arguments. So essentially what are all the arguments that are not arrays, so things that JAX doesn't know how to deal with. So in this case, the parameters you know how to deal with but not the model definition itself. And so if you run it like you see, I define this JAX.jit, my function, and then you can use it as before. You simply pass model and parameters. And now if we time it, you will see that this is much faster or somewhat faster. So instead of eight milliseconds, it takes only four. And while for this example, the gain is not very large, if you have larger networks and more complicated things or some workloads that are more reasonable, the speed gain could be much higher. Okay, so now what this asks you to do is to write a little code that given the Hamiltonian, it computes the local energy itself. So if you remember, now you have a function that converts the wave function to an array, like these two array. So now you can take the Hamiltonian, convert it to a sparse matrix and simply take the product to do this. Okay? JAX just using the simple, right? You just take your wave function, you convert it to an array as we said before. And then you can simply multiply the Hamiltonian as usual. You take the dot product, blah, blah, and this works. Right? And you can also see, like, okay, here it was playing a bit with converting the energy, using the standard Hamiltonian converted to sparse. You also have the JAX sparse Hamiltonian if you care about this kind of things and you can see that you have different performances. If you cheat, don't cheat, these kind of things. Well, it's not that important. So now what is interesting is computing the gradient of this stuff. So how do we compute the gradient in JAX? Gradients in JAX. Well, there is this lovely function called JAX.value and grad, which essentially computes the value of the function and the gradient. Why value and gradient? Because if you want to compute the gradient, it's doing a backward propagation. So to do a backward propagation, you have to do the forward propagation anyway. So anytime you ask for it, the gradient, you would have to compute the value of the function anyway. So that's why you get both. Then if you want just the gradient, you just discard the value, okay? So the way it works is you pass it the function you want to differentiate. So this compute energy that I define here, that takes three arguments, the model, which is just some class that defines how to evaluate, let's say my functional form, my product state or whatever else. The parameters, which was our actual numbers, and the Hamiltonian. So if I take the gradient of this function, I can compute the gradient with respect to three things. The gradient with respect to model, but I mean, okay, this is mathematically not well defined, I have to compute the gradient of some C++ class, let's say, so forget it. The gradient with respect to parameters, this is well defined, and the gradient with respect to the Hamiltonian, that it's also well defined. So what they do here is I simply take, define this compute energy and gradient in the same spirit, right? Again, it will take the model, the parameters and the Hamiltonian, and very stupidly it will simply take the gradient of compute energy, and I specify this arg nums equal one. So I say I want the gradient with respect to the first argument, which is parameters, if you remember. So what this object does with juxtap value and grad is to take a function and to give me a new function, okay? So it doesn't really do the calculation, it takes my function, it creates a new function that does what I want. This is also called higher order function or functor by programming language funds, and then, so this is my grad function, and then I can evaluate it. It takes the same arguments as before, so the model, the parameters and the Hamiltonian, but now it will give me my result. So the way that this works very loosely is that juxtap will execute your function with some fake arguments, will understand what happens inside of it, and then it will replace for every function call, whatever respective function computes the value and the gradient recursively. And that's how it builds essentially, it creates some source code representation of your function, then it transforms the source code representation into something that does the value and the gradient, and then you can evaluate that. Now the key point is that this thing is quite slow, and so you want to compile it, so that all the logic that does the parsing, the conversion to source code, the transformation of source code to source code that computes the gradient, and then the execution, if you jit compile all of these, you eliminate all the overhead of doing those transformations. And so, in a sense, like a good rule of thumb when using juxtap, is that anytime you want to compute the gradients where you should always jit the gradient evaluation. And that's why there is this line here, this first line here, that essentially computes the function. And now if I execute this function, you can see that I get both the value, energy, and the gradient, right? And if I print the gradient, you can see it's like minus 0.4, whatever. And so now I can use it to optimize my parameters, right? So that's what the next block does. So I use TQDM to just print the progress bar because I really like progress bars, but that's not very important. You see here, I first initialize my model, so I create the model, I initialize the parameter with some random data. I create a logger. This is some fancy thing to just store all the energy, like this historical series of energy at every step. And now I create a loop. What do I want to do at every iteration? Let's say I take 50 iterations because that's a very simple model. I want to compute the energy and the gradient. Let's call it energy, gradient equal. Let's see, yeah, compute energy and gradient. Thank you. Yes. I pass the Hamiltonian jacks bars. And now I want to update my parameters, right? So I do parameters equal, jacks.tree.map lambda x parse grad. So essentially lambda is a function that takes both parameters, every set of corresponding objects, right? So parse and grad and I do parse minus 0.01, for example, times the gradient, okay? So essentially what I'm doing, what really happens is this. I take my parameter and I subtract 0.01 times the gradient. And then I log the energy. And if I do this, you see I have a beautiful progress bar. And now if I plot this, I get this curve, okay? So the energy is going down. Hurrah, that's what I wanted. And then, of course, it converges to something. Of course, this energy will not be very good. So if we plot it in, like in log scale, with respect to the ground state energy, you can see that this is 18. So we are very far from the ground state energy. That's not surprising, right? Because I have one parameter. I mean, it's really surprising that this gives me a number at all. Like, I have one parameter and I'm very far from the solution. So I need to build some better answers, right? So that's what the just-to-answer section is about. So the just-to-answer was originally proposed to study fermionic systems. So it was originally proposed to encode the Coulomb repulsion between different electrons. But it turns out that it's deeply connected to the structure of the ground state of Ising models. And it's a very good guess for some quantum Ising model, like this one in particular. So the answer to essentially has this form. You build this Jij matrix, okay? So this matrix of parameters, number of sides by number of sides. So L squared by L squared in our case. And you contract it with sigma. Sigma are your configuration, right? One or minus one for every side. And so, of course, our just-to-answer encode the log of this wave function. So it's simply this contraction here, okay? So this structure can encode the ground state of any classical Ising model. Of course, for a quantum Ising model, it's not true. And that's why it will not give us the exact answer, but it will give us a decent answer. And so, again, we need to build a class like this, okay? So define a class that is just-to-to. Again, the call function, okay? I have this input X. So the input that I feed it, like it's this batch of samples that I was discussing before. And so what they do now is they create this J matrix. So this is the syntax to create a set of parameters, which I call J, and it's n sides by n sides, okay? Then I do a bunch of boring stuff where I promote the type, so to make sure that the type of J and of the input is the same. So maybe one is single precision, the other is double precision, maybe one is complex, the other is real. This is just boilerplate. It's not very important. This is not doing nothing. Then I symmetrize this J matrix because, of course, right? If I have this structure, I want it to be such that this matrix is symmetric. If it's not symmetric, if it's not symmetric, let's say this doesn't make too much sense in a sense. And besides that, there is another very important thing. Like if I know my ground set is translational invariant, I could also build this J matrix in such a way that is encoding translational invariance. Here I'm not doing it because it's a bit annoying to do, but it's also possible to do it. Now, to make it symmetric, in theory I only have n times n minus one divided by two free parameters. So I should have a parameter of n times n minus one divided by two variational parameters and then insert them in the matrix, right? These I could do. However, I didn't do it because I'm a bit lazy. If you want to know how to do it, you ask Luciano here that would be very happy to show you. But essentially what I do is simply I create a matrix. It's not symmetric. Well, I simply symmetrize it by summing its transpose. Okay, and that's valid. That's because I don't know how to, because the variational optimization doesn't know how to preserve the symmetricity of a matrix. So to be sure to always compute something that is using a symmetric matrix, I simply over-parameterize it and then just eliminate the redundant parameters. And then to compute the result, what I want to do is literally this operation. So if you're familiar with n sum, so n summation as copilot is, well, it's super simple. You take the input and the J matrix and you simply you see sum over i. So the last dimension, the sides. i, j are the two indices of the J matrix and the last dimension of x again. And all those dots essentially tells you those are the batch dimensions, the extra dimensions that encode the configurations. And these are R-Roses, the result should just have the same dimensions, okay? And then you return it. And so now if you evaluate this code, essentially what this does, it constructs with just one sets and it evaluates this code was checking that your code works essentially for one single sample, a batch of samples, et cetera. But in the end, that's not very important, okay? So now until now what we have done, we have computed the gradient of the full Hamiltonian, right? We converted our variational answers to an array. So this is not really exploiting the structure or the sampling that we discussed yesterday. So now what I would like to show you is how to play with Monte Carlo. Of course there's not much time left so I will just go a bit quickly through this stuff. So again in that case you can take the samplers that are already built, like a Metropolis Sampler. You can specify a transition rule. So you remember we discussed that when we build the Markov chain you want to have like some transition proposals. So either you flip one side or you swap two for configurations, this kind of things. So here I'm using the local rule which flips a single side. You could build these on your own and just like this stuff has already been written. It's quite flexible so it's very easy to use. And you can specify how many Markov chains you want because usually you don't want to propagate one chain at a time. You want to propagate many chains at the same time because it's computationally more efficient just because of how computers are built or GPUs are built, mathematically of course it's equivalent. And the syntax of doing that, again, a lot of boilerplate. You simply take the sampler, you initialize its state. So it's something that stores some internal variables. You reset it essentially just to settle the history to zero, et cetera to see the initial random configuration of your Markov chain and when you sample it. Essentially every time you call this function you need to provide your model, your neural network architecture, essentially the parameters that you have. This sampler state which keeps the history, the last configuration of the Markov chain and you can specify how long you want the chain to be. And if we look at the output of this you will see that you get something that has the shape 20 by 100 by 16. Which essentially means you have 16 sides, right? So the last dimension is the 16 physical configurations of bit strings. And then you have 20 chains each with 100 configurations, okay? So I did essentially if you want, I have 20 by 100 samples and every sample contains 16 physical configuration of the 16 physical spins. Is it clear? And now this dreaded monster which is how do we compute the energy? And if you remind I was saying local energies is the first step. So I need to implement this code to compute the local energy. And what I would like to show you here is this key point here. This Hamiltonian dot get comparded. So this get comparded essentially is this function that if you feed X, it gives you all the Y's such that the matrix element is not zero and it also gives you the matrix element, okay? So if you look at it, let's see. If I look at this sigma, sigma is my X in this case, right? It's this bit string like up, down, down, up, down, et cetera, et cetera, et cetera. And now I want to look at what is the connected element that I got? Like this eta, okay? And what do you expect to find here? What are the connected elements of the Isingham Hamiltonian? Anyone? Speak loud. Okay, so what is, so in the Hamiltonian, right? We have these two terms, right? Sum over X sigma XI, sorry, sum over I plus sigma IJ sigma ZI sigma ZJ, correct? So if I have here a configuration X1, XN, what does this do? Does this change a configuration? No. Exactly, it's diagonal. So this term gives me, for all the IJs, it gives me the same configuration, okay? So I will find one term, one, like let's say Y zero, that will be exactly identical to X, okay? This is actually the answer. And then what does this term do? Yes, this flips a single spin. And I have L squared on both. So then I should find L squared terms, 16 in our case, that are identical to X, but each one has a single spin flipped, okay? So let's look at it. And if I didn't get something wrong, this is exactly what we should see. So let's look at the first one. So the first one is actually the same as before. So we can simply do eta minus sigma and you can see, oh no, sorry. So the first one is actually sigma flipped, okay? And then if I look at the second one, it's the second one flipped, you see? And then the third one is flipped, et cetera. And if I look at the last one, it would be the identical one, okay? Is this clear? So I have an, in total, those should be L squared plus one. Let's see if I'm right. Yes, you see? I have L squared plus one configuration, so 17. And every one of those configuration contains 16 physical spins, okay? And at the same time, I have the matrix elements, so those h sigma eta, so h x y. What do you think those terms would be? What are the matrix elements corresponding to the local terms where I flipped a single spin? It would be the transverse magnetic field, which we had set to minus one. And only the last term would be the sum of all the j terms, so why is it zero? It's beautiful. I think I forgot to add the transverse field, but okay, let's go on. There's not much time. Anyway, so we have all those terms, okay? So now to compute blah, blah, blah. So now when you do get compadded, you can also fit not only one configuration, but also many configuration. And instead of getting back just a connected entries for one configuration, you get all the connected entries. So for example, if I feed a batch of 4, 5, right? Or like 10, 100, like all the Markov chains with all the configurations, if I do this get compadded, this simply injects a new dimension. So for every one of the samples I have, I simply get a set of connected entries. So you see, if my samples are stored at this batch of 4, 5, so 20 configurations, every one of which has 16 physical spins, now all my connected entries would be, I will have 20 sets of connected entries. Every set of connected entries contains 17 configurations and I have 16 times here. Okay, is that clear? Good. So the way we compute the local energy then is simply we define this function, right? Where I have the model, the parameters, the Hamiltonian and sigma, so my configuration. I evaluate the wave function for all my sigmas, right? So the current configurations. I evaluate my model for all the eta, so all the connected components that they get from here, by simply doing Hamiltonian jacks that get compadded. And then I want to do, yes, I want to make sure they have the same shape, the same dimensions, because eta will have one extra dimension, right? Because I have one extra axis, which is this axis that you see here, this 17 axis, right? Because which spans the connected entries. So I want them to have the same shape, so I had the one dimension here for sigmas. And then I do, I simply, no, this is wrong, copilot. What I do is I simply take my matrix elements, was then here, I multiply them by log psi eta, which is the numerator, minus log psi sigma, because I'm in log space, right? So the division becomes a difference. And I should then take the exponent and then I sum over the last axis. So I sum over all the connected components, right? So I do this, no, this is wrong, sorry. No, I take the exponential of this difference. So essentially I build this term and then I sum over the last axis, okay? So essentially this is just an implementation of this formula here. And if you were working on it on your own, this would check that it's indeed correct, okay? And then you can also jit it. Again, you need to tell him that the model is something that he doesn't know how to work with. And if you do it, it will see the work and give you those numbers, okay? So now, the last thing before I say you goodbye and I will let you go have your nice weekend or maybe you can keep working on this all weekend and not deep in the sea, I don't know, I think. Do you? So how do you estimate the energy using variational Monte Carlo? Well, I told you before, right? You have this called those samples. They give you samples distributed according to the probability distributions. We have a function here that computes the local energy for every one of those samples. And so to compute the average, we simply do the mean. And that's it, done, super easy. To compute the variance, well, again, to estimate the variance, take the variance of it, easy. And to estimate the statistical error, well, indeed. So I could also just do the square root, right? Take the square root of the variance divided by my samples to compute the samples. For example, I could do and do this. And samples should be like all the shapes, right? NMP.product divided by sigma dot, the last shape. So essentially what this is looking at is the shape of your samples, right? And then it's simply dividing it by the number of spins you have, and this gives me an integer. And this gives me the error. And then you can just package everything in those statistical object, but that's not that important, but pack up everything. And so you see, if I run this object, now I can get here my mean, some error term, the variance, everything else you want, okay? And so now if I want to estimate it to better accuracy, what do I do? I simply take a lot of samples, I take a chain length of 5,000, and then I evaluate my energy with a lot of, I evaluate my energy exactly, and then I can estimate it with Monte Carlo. And you see the exact value is minus 16, and here this is probably just some random numerical noise, and you see my estimate is minus 16.007 plus minus 0.072. And if I were to add the number zero here, I would expect my error goes down accordingly. See, from 0.07 it went down to 0.02. Of course, it doesn't go down so fast, right? Because it goes down a square root of a number of samples. So it's a bit annoying. So to get down by one order of magnitude, I need to increase by two orders of magnitude by number of samples, okay? And so super quickly, how do you compute the energy of a gradient? If you remember yesterday, I told you what we do is we rewrite the estimate of the energy gradient as this product, right? The Jacobian of my wave function times this vector of local energies minus its average. That's the last thing I will tell you. So what we do is we define this function, log c sigma function, simply to, basically, so this is my model applied, like it's my model where I pass the parameters, and I only accept the, and essentially it also has the vector of samples here. And you can see that the output shape of this is 2000. Essentially it has the output shape is the number of samples that I had. And then if I want to get the Jacobian, I can simply do like this jack dot jack rev means Jacobian computed with reverse mode, automatic differentiation, and I simply pass it this function, right? That there's only one argument, the parameters, and I pass in the parameters, and this gives me the Jacobian, right? So the parameters, it's this J, which is a 16 by 16 matrix. And yes, and if you see, so I have a 16 by 16 matrix, right? But now if I compute the Jacobian of this function, the Jacobian with respect to the parameters for every sample, essentially you have that for every sample, I have a 16 by 16 matrix. So the Jacobian of this function with respect to the J matrix in my juster, which is a 16 by 16 matrix, is essentially this object, this tensor, which has the first axis is 2000, and the other axis correspond to the parameter that I'm differentiating, okay? Is it clear? In a sense, the Jacobian is normally mathematically we define it as a 2D matrix, where the first, like the rows correspond to the outputs here to the samples, and the columns correspond to the parameters. But now I can generalize the object and say, if I have, if I'm computing the Jacobian with respect to something that has a lot of dimensions, well, I simply append all those dimensions here. So it's only the first axis that correspond to the samples and everything else it's the same as the parameters I'm differentiating, okay? And so you can play here in the rest of the notebooks that I leave it up to you with computing those gradients, et cetera, et cetera, and then converging, okay? So if you have issues because it's not easy to do, you can just go on the website of NetKit, documentation, and this last tutorial here, VMC from Scratcher, is exactly that notebook I gave you, like now do some tiny differences, and here you will find all everything that I showed you, including the solutions, okay? And so if you want to play with this, you can just take this code and we arrived sampling here, sampling the energy gradient, and you can literally look at the solutions and play with it yourself, okay? With that, thank you for listening to me for so long. If you have questions about this, don't hesitate to ask me, send me an email or whatever. On the NetKit website, there's also a small button at the top right here that allows you to join a Slack channel where people chat about it, so you can, if you have questions that are over, people that might give you answers. And yeah, I think that's it. Those two notebooks, you're free to play with them. They allow you to build the full variational Monte Carlo from scratch, and yeah, have a nice weekend. That's Filippo, final questions? If not, then I would like to take the chance to, of course, thank Filippo again, but also all the other lectures of the school, like Florian, I see still hanging around, Miles was also hanging around, to all of them for this great school. Thanks a lot. Thank you.