 Today we are very happy to start our third lecture of the conference and we have Filippo Vicentini from Ecole Polytechnique, who will tell us about simulation of quantum antibody system with neural quantum states, please go ahead. Thank you Estelle, thanks for the introduction, good morning everybody, I will start a bit slow so you can slowly wake up from whatever you did last night in the center of Trieste, I hope you tried spritz and all the beauties the city has to offer. Yes, so maybe before I start to wake you up a bit I would like to tell you very briefly about how I came to be here because it's quite funny, I never started by doing really like machine learning, a few years ago in my PhD I was actually studying quantum optics and open quantum systems and only towards the end I started to get interested into like machine learning and in particular how we could use neural networks to empower and to let's say to solve some, to perform some simulations that we could not do otherwise. Fun fact that we never managed to do the simulations, but I still got my PhD and then I wanted to go to the Fritiron Institute in New York, like this funny man said no you can't, he refused my visa so if anyone of you ever has problems with visas like guys it happens to everyone, it happened to me twice and so eventually I went to Lausanne to work with Giuseppe Carleau and which was very nice, I mean the city is not the same but scientifically it was a very nice state and in that process I started working a lot on our quantum states and machine learning for quantum physics and we started over the years to try to create some bridge between the language that is used by the machine learning community and the language that is used by the quantum physics in particular computational quantum physics community and I also became a strong advocate for open code, open sourcing, everything we do because I strongly think that let's say if you do your research, if you publish your results and you don't share the code that is used to generate those results at least in numerical physics, your research is not reproducible and therefore you're just slowing down the research as a global machine, of course you might think that this gives you a competitive advantage because other people cannot take your code and do similar simulations but at the same time if you share your code with everyone else they will start talking to you, they will ask your expertise, they will discuss with you and in the end what happens by doing this, at least what was my experience, is that you just create bigger and bigger collaborations, you meet new people, you work with more people so really in my experience and since you're all young and you're probably starting or started your PhD in the last few years, it's very important to always be very open about this, we can be 100% reproducible in this field of research at least for purely computational physics so let's try to be, at least that's what I would like to be. And nowadays I'm in Paris at the Colpolitechnique and yes so for anyone who is finishing their PhD or I don't know is bored of their postdocs or not bored but one needs to find a new one, we are looking for a postdoc, there's some of my people down there like if you want to know if I whip people, if I hit people or not ask them they will tell that it's absolutely not true, right? Okay so with that very briefly what am I going to discuss, right? So at least in machine learning for quantum physics is this funny little community that is growing continuously and there's many blobs inside of it, those at least are some that we identified last time we met in a sense, so you may have seen with Elisica some analysis and detection for quantum data so you collect some data right from a quantum computer or from some experiment and you do some analysis, you do some machine learning techniques and you need to extract some more information, you will see later on I think something about optimal controller controlling experiments or doing some reinforcement learning on some experimental device or some real devices, there are several lines of research about using machine learning to improve a compilation in a sense of quantum algorithms but that's not what I'm discussing today, today I will discuss about ab initio quantum simulation, so which means essentially I have a quantum system, so someone tells me there's an electron, there's another electron, there's a third electron, electrons interact through the Coulomb force and what happens? So there is no data whatsoever, everything I'm going to tell you today no data but you can think of, you can try to think in terms of what's the data set to try to find bridges to machine learning but there's no data at least not in the standard sense, I'm just trying to understand what happens, what's the structure, what's the solution of a quantum system, so I'm pretty much trying to do what people have done for a very long time, actually all the techniques I'm going to present today are quite old, I'm just trying to present them in somewhat more modern fashion but I'm trying to draw some bridges towards what is done in machine learning and just a few references very quickly, well okay everyone knows the books about from Goodfellow and Banjo which well it's not too bad but it starts to be a bit updated maybe, I really like that other book by Marfi on Proalistic Machine Learning which contains several interesting tidbits about and it's a bit more up-to-date I think. Today's talk there's quite a bit of material from this book or lecture notes, Modern Applications of Machine Learning to Quantum Sciences which exists only thanks to the sheer will force of Anna with somewhere around here I think, down there, don't hide, which originated from some school a bit like this one organized in Warsaw a few years ago, I will mention a few things about automatic differentiation later on, it's a different picture, more mathematical but which allows you to understand it a bit better, I strongly suggest you read was notes from Bettencourt, Michel Bettencourt, who is during his PhD invented neural differential equations somewhat, which is a geometric theory of higher-order automatic differentiation, just chapter one and section two and three they deal with higher-order automatic differentiation, very complicated things but section one is just three pages long, it's a very clean introduction about what it really means like mathematically to do automatic differentiation and what are the fundamental operations and how to reason computationally, algorithmically in terms of what can be done and what cannot be done and finally code wise well I'm a strong Jax advocate so you cannot ask me anything about PyTorch or not much, I really like Julia but it's very hard to do the kind of things I do with Julia nowadays, so really push everyone towards Jax, on Friday there will be some interactive tutorial using some notebooks we developed also based on Jax, we developed a code in the spirit of open source and open code and sharing everything which is called NETCUT, which implements pretty much most algorithms let's say that we worked on in the last years and there will be some notebooks in this Github repository collected the notebooks from my lectures and I will add something for Friday, okay, but you find a lot more material so you can also have a look at some point about stuff that will not be discussed in the lecture, okay, so I hope now you're awake and let me give a very brief overview of what I'm trying to solve so I will be I will go very fast I just want to settle down on some notation, let's start down there, so I will say something super obvious now and just to settle down on the notation so everyone agrees please interrupt me anytime during this lecture I would be happy to answer questions, okay, so mainly for today I will be discussing spin systems because they're simple so a single spin we can say that the wave function right psi is an object that lives in the Hilbert space which I will write down like that essentially it's a collection of all complex vectors which are a linear combination of some basis elements in this case up and down, okay, I will use this sorry okay let's try so so I will label with b of h not bounded operators but the basis of of the silver space which in this case will be simply the set of up and down okay so those are the two basis elements that I have and the dimension of the Hilbert space is essentially the number of elements in the basis and in this case it's simply true okay this is all very simple and in my notation the wave function so I could be more mathematical about it I will be very quick and sloppy because we don't need the details but the wave function is simply an object that takes elements from the from this basis okay actually anything but in particular from this basis and gives us a complex number which means essentially you feed it I don't know some up configuration and you get up psi equal psi of up okay and I know this is the dual etc like let's just be very very very sloppy about this fine and and you can always use this to insert an identity essentially and you can always decompose the wave function something like that right so the identity is nothing over when a sum of the projectors projectors on all the basis elements so I'll write it like that sum of x of x x psi where x essentially is the elements is the basis elements okay which essentially means this is simply psi of up up element plus psi of down down element is this clear for everyone okay so you see now my my wave function that I can write it like either like this bracket or like this really like function evaluation in a sense I feed it one of those basis elements like up or down and it gives me an output okay and now mathematically I can always have this up and down correspond to some numbers like zero and one minus one and one okay I can always take a representative of it is this clear yes good this is let's say one body physics now I will actually want to talk about many body physics so I will actually take an Hilbert space that is a tensor product of many particles and in this case essentially I will have that a Hilbert space is given by the linear combinations of the let's say up up up up plus alpha up up up down etc etc okay so the basis likewise is of this many body Hilbert space will be all spin ups one spin down and then all spin down the dimension of this Hilbert space of course is exponential two to the n and so if I want to unpack my wave function in this basis what I get is something that looks like this sum over x so let's say over x1 xn of psi of x1 xn x1 xn clear and for the rest of the talk I will again be a bit sloppy and I will just write this as sum over x of psi of x x where technically this is a vector practically I will never write it okay just remember it and the sum therefore runs over exponentially many entries so doing this sum is expensive okay now if I take my basis not as a unordered set but as an ordered set like I can say that this is the first element of my basis this is the second element of my basis this is the last one I have a one-to-one correspondence between an integer zero one two three or one two three four depends how you count and and those basis elements so in a sense I can always store in a computer the wave function this object as a vector as a vector of complex numbers right I can store it somehow like this as psi of up up psi of up up down psi of down down right this vector will be exponentially large so if we do 20 sides well if I do 20 spins true to the 20 it's I don't know if you yes 16 megabytes of memory it fits in my my laptop easily no problem if I do 30 spins that takes 16 gigabytes which I think is all the memory I have in my laptop and if I want to do 14 let's say 15 spins think about I'd written it down yes so the largest supercomputer now this has five petabytes of RAM so yes so we can probably do about 50 spins by filling up all the memory of the largest supercomputer on earth you might go to a policymaker and tell him that you have a solution at 51 spins so you just need to build a second supercomputer okay now you have two supercomputers you can do 51 and actually the answer is that 52 spins well you need two more supercomputers and so right that's the quantum antibody problem we cannot store this wave function we need to compress it down somehow do something that is manageable or we need to make approximations that in a sense truncate away the quantumness from my problem or that leverage some of my physical understanding of the systems such that I don't have to deal with so large such a large data set or such a large object in memory so this is the memory complexity aspect of the quantum antibody problem however this is not the only phase of it I like to think of the quantum antibody problem as a coin with two facets okay one is the memory complexity and the other one is the runtime complexity or computational complexity which means this is purely a storage problem imagine for a moment that I have a black box that can store a lot of data like from any of your favorite sci-fi movies and now I I just want to now I can query this box I can ask you give me the numbers give me the numbers like give me this entry give me like the hundredth entry give me the last entry and you can query it but how do we compute expectation values in quantum mechanics expectation values are computed by taking this dot product right type sci and this essentially how do I write it in terms of those configurations it's just eventually what I saw in my computer where again this might be very obvious for most of you this is just a sum of the x y which arise from inserting an identity here and here I will forget the denominator for a moment it's there as well right so I have three sums and if you didn't forget that I have this fancy notation for actually summing over n binary variables well it means that those sums are summing over exponentially many entries so here I have two sums that run over two to the power of n entries so even if I have a black box that can solve the memory aspect of a quantum many body problem I still have to compute things and computing things will take exponentially many operations so if you use I don't know like if you have a big supercomputer that can do it in a reasonable time you add another spin you need twice the amount of time or twice the number of supercomputers and like this eventually breaks down so you cannot solve this problem efficiently so we don't only need to solve the memory complexity aspect we don't only need to find a compression way a way to compress the wave function but we also need to find a way to compute quantities efficiently okay and essentially I will be discussing those two aspects and it's important to keep in mind these duality because there are some algorithms like I think you will see them this afternoon like tensor networks that can solve and can address both problems at the same time however there are more general compression algorithms in a sense like neural quantum state which is what I will be discussing today but only solve the memory aspect and we need another algorithm to solve the runtime complexity okay good so how do we solve this problem well it's it's very simple what we do is what we do is we take the wave function and instead of storing this very large number of complex numbers we do a little trick no I don't have colors unfortunately but I do have colors it's fine so instead of storing this large set of numbers I will instead now store a vector theta let's say where theta leaves in some w which I will call my variation space or my space of variation parameters which it's defined such that the dimension of w is much smaller than the dimension when the exponentially large dimension of a Hilbert space okay and now I simply just add a paddock theta here everywhere okay so the semantics of everything don't change but now instead of storing all those numbers like that's true to the power of n complex numbers I just store a small set of complex numbers theta and I can just recompute those entries at every time so I don't store all the entries I store a compressed version of it okay and like in in this sketch here like you really you really can think of it you compress it down and like evaluating the wave function for one entry is just some sort of decompression okay and now if this variation space is very small if I just have a small number of parameters and not an exponentially large one or if I have a polynomial number of parameters well I compress down my wave function I can store it and I'm happy is it clear I will discuss later what we can use to to compress so mathematically one thing I like to say is that in a sense we have some map that goes from this w space from my space of the variational parameters to the Hilbert space okay which essentially takes one theta which is a vector in a sense but again I will generally not write it as a vector otherwise I will just be putting arrows everywhere and that's boring and this associates to me like this wave function and then this mathematical object is the same as always okay so if you like differential geometry you can think of it in terms of manifolds and maps but I will not be discussing this too much but there's a lot to it okay now and again even if I even if I do this of course if I put small titas everywhere here you can see very well but I don't solve a runtime problem but I still have some so there are exponentially many entries and this issue is still better okay so let me let me give you a few examples of what we can do to compress like and so like was kind of compressions in like a historical jargon for physicists is variational and I think you have to put an umlaut here for the plural but I always get confused the Germans can correct me please okay so the most simple way is to the the simplest let's say variational answers I can think of if we can call it a variational answer in a sense is some sort of mean field or like of mean field projection so instead of storing the full wave function I I say that psi theta of x1 xn will just be some psi theta one of x1 psi theta two of x2 times psi theta n of xn okay so now every one of those it's a wave function for the single particle which is two complex numbers and I simply take the product so in total I have two times 10 complex numbers to store and I can always evaluate it efficiently right I could just query this entry this entry this entry I just take the product what happened of course this cannot represent all states if you know something about quantum mechanics and entanglement you will know that this is a product state and you cannot represent and encode any quantum correlation it's a brutal approximation turns out that if you do this and you plug in this this answer inside of this expression you can also automatically compute those sums because like they will factor out and everything will be much easier so you actually solve both issues but of course it's brutal and so we want to go a bit beyond I want to study systems with quantum correlations so so what you will see this afternoon instead is some sort of generalization of this approach which is I will only write the matrix product state or MPS for short which which belong to a larger family of tensor networks which are essentially correspond to writing your wave function as some contraction of different tensors okay so you can think of as if you have several tensors we now use Einstein notation so very well no I have some sums over i1 to yn and i prime to i prime n which are some indices and I have some indices with three legs so x1 i1 i1 prime and this is let's say i a1 so then a2 x2 is an index and then i no sorry it's only i1 to in so i1 to i2 i2 i3 blah blah blah i n to i1 and xn so basically a is a free leg tensor so it's not the vector not the matrix but the free free tensor one one tensor like as I mentioned true and I index it with x1 which is right either down or up and then the other two legs can be anything as large as you want and they kind of encode the correlate quantum correlations between different sides and if you take those legs I want to i2 etc to have just dimension one essentially you go back to this product state this is very powerful turns out that for several theoretical reasons that are very interesting this is probably the most efficient structure you can use for any problem that is one dimensional at least for the ground state but in general what do we do if we have a two dimensional system three dimensional system if we have some chemical molecule where it's very hard to define precisely a structure is that a question no so so what there's many other answers that have been developed over the years I will just cut it short and say that at some point in 2017 Giuseppe Carleo and Matria Streuer had the nice idea to say why don't we just take instead of something complicated that requires a lot of physical knowledge about the system why don't we just take a neural network and let's call it a day so essentially they said why don't we take for my wave function ansatz something that is like literally the exponential of the output of a neural network okay actually you can also think that this exponential you can just leave it away and it's a technicality okay this is this huge science paper like the idea is simply to say my variational answer is the output of a neural network that's it done okay that's now of course there's a lot of there's a lot of details about it and mainly like the questions that we started asking ourselves later where what neural network should we pick should we use restricted bolster machine fit for one network convolutional networks transformers graph networks etc etc how do i encode some physical knowledge about the systems like into those neural networks how do i make it permutationally anti-symmetric if i want to do fermions and those are a lot of details very important and very interesting that we keep asking us keep asking us each day kind of and but but the underlying idea is simply to say the wave function is complicated if i don't know how to approximate it i can just try to approximate it with a neural network why well because a neural network is a universal approximation a universal function approximator so in a sense this is a sketch of a fit for one network if you for a moment consider just a neural network with one hidden layer i assume you know how to read these types of sketches it turns out that if you take an infinitely wide neural network so if you have infinitely many parameters of course but there's some limit where if you increase the width you can approximate better and better an arbitrary function that's the universal approximation theorem so let's say if you have some f star some function that you want to approximate which would be my i don't know some target wave function i want to encode fw is my psi of theta if you want the wave function that is encoded as a neural network i can increase the width if i find the good parameters i can represent it of course i need to find the good parameters this would be the rest of this of this of this lecture now this is for one layer okay this is proven like already by c benko in 89 and and several others later on however this was as i said this is for one layer neural network like one hidden layer more recently people started to say why is deep learning so powerful why why do deep networks work so well well here it's problematic there are no hard results there are many conjectures at least last time i checked maybe i'm updated because it was a few years ago but i have heard no let's say any any huge award being given out so i think that's still the case and so there's a few conjectures that say that if i i send the depth to infinity and i keep the width of my network constant and every time you increase the depth essentially you're just multiplying by some number adding some constant number of parameters well then it turns out that the error that you commit in your approximation your best approximation goes down exponentially fast so you see it's very interesting because for one layer neural networks at least those are not very useful in practice but they're just um motivational results let's say like foundational results that tell us why this approach should work if i have a one layer neural network like one hidden layer neural network if i double the number of parameters the error of my best approximation should go down like polynomially okay instead if i have a deep network if i add another layer i'm not even doubling my number of parameters well the error is going down exponentially so much faster so this is not proven anywhere but there's a lot of evidence that it is the case there's some tentative proofs in some very edge cases that are completely useless in practice but actually interesting from a mathematical point of view and you can read about them in the paper by Lu and co-worker from 2017 now rips which is highly cited by today very beautiful if you like math because it's a bit intense but but that's the idea that's why deep learning is in in theory powerful okay because if you just if you can approximate things much better and so eventually we will want to do the same even if in the field of neural quantum states for the first three years we were just using one layer neural networks for most of our papers and it's okay and if you're asking your and before i delving the over aspects of how we compute quantities with neural network quantum states let me just say one very quick thing what happens if i encode my neural network this way well first assuming i if there are no assumption on the architecture of a neural network and if there are no assumption on the weights the neural network will be spitting out random numbers now the beauty of Hilbert space is that it's just a vector space so any set of random numbers it's a valid vector in the Hilbert space meaning that a neural network that is encoding a wave function will always be encoding a physical wave function maybe it's not the one you're looking for but it's always a valid state it's not normalized but it's valid okay the reason i mentioned this is because if for some reason you're a bit masochist like me and you want to study open quantum systems and encode density matrices this has to be much harder because you cannot allow for any density matrix to be encoded you want it to be a mission and you want it to be positive 70 definite maybe apparently not but okay right but for closed systems and Hilbert space is anything's good so there was this again aspirational result by our share in 2021 with joseph and others that essentially proved mathematically that if we take feedforward networks like very general very very arbitrary neural networks we can say that they are able to encode the states of all gap to one d amiltonians which is what you can represent very well with tensor networks with nps actually matrix product state and it proved that neural quantum states under very wide assumptions can always represent exactly matrix product state okay again i'm not talking about the details of the architecture i will be discussing later but you can take a feedforward network with one or two layers and they also proved that peps i projected entangled per states which are some sort of generalization to two dimensions of matrix product states can also be well not represented exactly but can be approximated with exponential accuracy by an arbitrary neural quantum state so in a sense we sketched they sketched this diagram where we say like neural quantum states in principle are more general now again i'm not saying that you should use neural quantum states for one-dimensional systems for own anything other than benchmarking because tensor networks are extraordinary for this kind of problems but for problems where you cannot use them well this tends to be fair game and then there were several results about the fact that neural quantum states can encode some string bond states some topological states ground states that they can represent volume law states and there's a lot of results about what we can represent to this day there are not many results about what we cannot represent what a neural network cannot represent and these i think is the next frontier understanding what are the limits of those techniques so i will not be talking about it because we have essentially not zero but very few results but there's very interesting questions to be asked about what are the limits and maybe you will be working on it in a few years so again for the rest of this hour let's say every time i write neural network rupsai theta is just an arbitrary network okay this could be some convolutional some some some fit forward but it's let's keep it general for now and instead let's discuss how we can solve the runtime complexity problem and how we can compute the expectation values efficiently so in a sense the problem is that the Hilbert space is too large and i have was some sort of a full Hilbert space so either there is some magical property like for tensor networks or product states where in a sense i'm truncating away some huge chunks of the Hilbert space i don't need to do this sum over everything or i need to just ignore blobs of the Hilbert space so what i will be trying to do will be to try to rewrite this expression in some form where instead of summing over the full Hilbert space i sum only on some carefully chosen configurations axis of the Hilbert space okay and essentially this means i will be doing some Monte Carlo sampling to select the most important configurations and evaluating the energy some sort of energy estimator only on some configuration so how do we do that well again i start from this expression psi theta h psi theta divided by psi theta psi theta now what i can do is i can insert an identity here as i've done that right and so i get sum over x of psi theta x x h psi theta divided by psi theta psi theta i've just done basically nothing and then i can take another color and i can do now something very i don't know very stupid but which is the foundation of all the calculations we do all the time which is i multiply and divide by psi theta of x okay so why do i do that because i want to extract some probability distribution so you see this is complex if i multiply it by complex conjugate i get the probability distribution the born amplitude so now if i collect them i get something that looks like sum over x of psi theta x square modulus divided by psi theta psi theta times x h psi theta divided by x psi theta okay now there's a little caveat i divided by psi theta of x this might be zero and the mathematical police might come looking for me turns out that i am allowed to like the losses i can why because psi theta of x is already in this expression so if psi theta of x was equal to zero this blob would be zero okay in a sense what i'm saying is that with another color i can say that i sum over all x's such psi theta of x is not zero okay because then i have another blob where i sum over the x's where psi theta of x is zero where i have psi theta of x x h psi theta divided by psi theta psi theta well psi theta of x is zero so this is zero so this is zero right and so i'm allowed to do that remember that because later on i will do this and i will not be allowed okay good so i do this and i get this expression now this blob here is the born probability amplitude right so it's it's a quantity that is so this blob here i i will often call it p theta of x which is a real number it's larger than zero and it sums up to one do you all agree is it clear to everyone yes good so i can also write these as expectation value of x taken from p of x which is just the square modulus of a wave function of x h psi theta divided by x psi theta okay and i will call this thing inside here i will call it h log theta of x this is just an historical name this is called the local energy or local estimator in a sense because i'm sampling a probability distribution i get configurations access from the Hilbert space depending on their probability amplitude and then i can compute some local estimator local in the Hilbert space that depends only on this x and the parameters theta that tells me how much this configuration contributes to the energy yes now this is an expectation value so if i want to know what is the average age of the students in this room i can ask everyone of you and this is correct or i can just sample some representatives of you well chosen and just take the average among them this is not an exact calculation anymore i will not know the exact average age among you but i will know i will have a good guess and the more people i ask the better this approximation will be right so why i don't do the same why don't i pick configurations from a probability distribution drawn according to the born amplitude and i just compute the local estimator for those configurations and not for all of them and since i have an exponential problem why don't i just take polynomially many so a polynomial number that i can treat with i can deal with and not an exponentially large number yes so that's what happens in practice essentially i approximate this expectation value with its sample mean or sample average which means i choose a certain number of samples ns number of samples i take a sum over let's say over some axis taken taken from a set let's say no let's write it as i from one to an s of h log theta of x i where x i are distributed accordingly according to this probability distribution okay this is clear i see confused heads and now this so you see i started from a sum over exponentially many entries i end up with a sum over few entries and i'm happy now this all works and it's a polynomial if h log is also polynomial time so if i can compute h log in in in a short amount of time not in an exponential time and i don't have blackboards anymore let's take this one and if you remember here there were two sums right i put two identities i solved the first one but i still have a second one to deal with maybe h log of x i need to be able to compute this in polynomial time what is this well i define it to be x h psi theta divided by x psi theta so this is fine this is just a query of my neural network so i'm happy but this one is a bit more tricky right because here i can put an identity i should put an identity and get a sum over y of x h y of y psi theta divided by x psi theta how many entries are in this sum exponentially many right so i didn't solve the problem though i did because it turns out that the Hamiltonian is log sparse or it's k local for most of the problems we're interested in i would make you an example let's say i take as an Hamiltonian the sum of sigma x i okay so the sum of a transfer sealed on every side well let's say i i did some sampling i got the configuration which is like my x is up up up up okay if i know that this matrix element is zero i can drop it out of the sum right so i can restrict the sum to the y is such that h x y is not zero and this is fine correct so what are the matrix elements that are not zero well it's very few of them right because what does sigma x do it simply flips one spin so here if i take i ask myself what are the y such that this object is not zero well i can simply replace the definition of h so i get sum over i of up blah blah up of sigma x i y and what is the y such that this matrix element is not zero well it's only one correct it's only the one where i flip the if spin so i can write this as a sum over y of up up and here i have up down on the right side up do you agree so i call those the connected elements of the of the matrix so given an operator or a matrix h i i can take a configuration in the hill in the my the basis of my hillbar space so like all apps or some other thing i sampled i take i generate whatever i don't care and then i ask what are the connected entries which in a sense means like what are the configuration that h connects it to such that h is not zero and in general they are polynomially many as long as your Hamiltonian is physically reasonable which means you have some local interactions few body interactions like one two three four body if you go to eight body of course you start to have a lot of those connected entries okay and so you get one of those entries in general for every poly operator you have in your Hamiltonian this is very similar to what happens in quantum computing and so in a sense let's say for this particular miltonian the sum here over y will simply be essentially a sum over all the y's where i flip only one side okay so i have only order number of spins non-zero entries in this sum and then i simply query my way function for those modified configurations i recombine them accordingly and i'm done is it clear in a sense computationally speaking if i want to build some code or some algorithm you will see it on friday that that that computes this what i need is simply a black box a function an object to which you feed the the configurations of your Hilbert space and you ask what are the connected components what are the components such that the Hamiltonian the matrix element is not zero and also what are those matrix elements because you have to put them here okay so if you play with netkit you will see that we exactly have one such function which is called get connected elements get con which gives you those objects and it's a very good building block to implement this kind of algorithm and it's something you should do yourself if you're writing those algorithms algorithms from scratch for yourself okay are there any questions good so in a sense until now i prove i told you that we can compress the wave function down to something small by using some variational representation that was variational representation if i have a polynomial number of parameters they are efficient in a sense i've also assumed i didn't say it but i assumed that when we evaluate those bit strings those configurations when we evaluate my neural network the evaluation of the network is efficient as in its polynomial time it's not exponential time of course because if i took an exponential amount of time just to evaluate one configuration i would not be discussing it okay but this is also part of the efficient now i said that to compute expectation values or any bracket essentially i can always take those expressions and rewrite them in terms of classical expectation values over some probability distribution in this case the born probability amplitude of some local estimator and the local estimator in turn is efficient to be computed because my amiltonian is physical because it only has few body operators yes yes what is the expectation value of this expectation value what is the expectation value of the imaginary part yes exactly so so since you already know the answer you can just truncate it and it can actually be used to know how off your algorithm is how off your sampling is because if you see a huge imaginary part well it's a warning bell that something is off you don't know what but something is off so it's actually handy to check it but indeed you know it's zero so you can just you usually just truncate it okay any more questions okay so one thing i didn't discuss yet is how can we generate both samples right so by the way both samples are very easy to generate in a sense on a quantum computer because if instead of an neural network here i had the variational quantum circuit or some quantum circuit both are just the bit strings i can get out of it however how can i generate them starting from a neural network so if my if my model is generative right i can just sample from it if my model has a tractable marginal i can just generate configurations from its probability distribution but in general my model is not generative in general the marginal is not tractable in general i do not know how to generate probability samples distributed according to this why because it's not normalized right because my neural network here is completely general it's not normalized to one so if i draw a configuration it might be 10,000 or 100,000 i don't know the reason why he this is an expectation value is because i divided by psi theta psi theta by the by essentially its normalization factor by computing this normalization factor involves the sum of the full Hilbert space over all the basis elements which is exponentially large and i cannot do that okay so one thing i would like you to remember if you just wake up for 10 seconds and you can go back to sleep is that every time you have those objects the way we construct a sampling estimators is always by looking at what we cannot compute and we make it disappear by shoving it inside of the probability distribution okay these psi theta psi theta the denominator i do not know how to evaluate it i do not know how to estimate it so i simply try to construct a probability distribution for which this is the normalization okay because then when i write it as an expectation value the probability distribution actually all the probability the probability with its normalization disappears is it clear or was it too fast it's a guiding principle if you ever want to write a stochastic estimator for some crazy quantity ask yourself the question what can i not compute because it's too hard and make it disappear with a sampling okay good can go back to sleep now okay so now i want to quickly say something about how we generate samples drawn from a distribution assuming my neural network is completely arbitrary or not generative for those that like machine learning language how do we do that we do markov chain montecarlo sampling with a metropolis has things acceptance rule so if you know what this is already raise your hand do i need to show it yes okay then i will quickly show it the others can i don't know go back to your twitter feed or anything for five minutes let me get my notes yes so question is how can i generate a number of samples that are drawn from a probability distribution p of x okay and i want to generate some set of samples x 0 x 1 x n i'm solid for julia people that they start counting from zero really like i feel you i should stop at n minus one so i'm really doing a mess so how do we do that well we can start from what is known as the um notes notes detail balance condition sorry so this is the starting point from which we there we can derive for algorithm and the idea is the following if was configuration respect the detail balance condition then i know that um so it was configurations that distributed according to this probability distribution they should respect the detail balance condition the detail balance condition tells you in a sense that if you have a set of samples and you change them the way you change them does not change the overall probability distribution okay so it's some sort of if i have some non-equilibrium system at equilibrium meaning if i like if i have some macroscopic equilibrium but not microscopic because there are still some changes at the microscopic level i need to respect this condition and the condition tells you essentially that the probability to be in the configuration x of t times what do i do yes t times the probability to go that the configuration x of t becomes x of t plus one at the next step should be equal to the probability to have a configuration x t plus one times the probability to get x t given x t plus one essentially it's telling you the probability that i take one random configuration from my distribution and i change it to the next should be equal to the probability to have his next configuration and to go back some sort of reversal symmetry okay and this is very like this is a foundational result in statistical physics it's very beautiful there's a lot to say about it so i can start to play a bit with a math and i can write this thing i can say um yes i can say well then it means that the probability to take the ratio of the probability of a transition so the ratio of the probability of a transition should should respect this condition right i just divided and applied and now what i want to think about is i want to find a way to construct this chain of configurations so it means if i have one configuration at time t i want to be able to generate the one at t plus one how do i do that i kind of know this probability distribution kind of the ratio at least i don't know this t object so i try to make a guess for the t object and so i will split i will completely say my t of x t plus one given x of t can be factored into two components which are in a sense the probability to propose some configuration x prime x t the probability to accept x prime sorry let's call it x t plus one x t plus one given x t okay so i'm saying i split these mathematical objects into two one will be the probability to propose a new configuration and the other is the probability to accept it because in a sense my algorithm will be something that looks like i have a configuration i propose a new one and maybe i accept it and then this is my next configuration or i reject it and so i just keep the odd one and i go on building my chain so now if i plug these definitions inside of here what i get is that um essentially i substitute this blob in here i will get something that looks like this the probability to accept t plus one and x t over the probability to accept x t x t plus one will be equal to one e of x t times g of x t plus one even x t divided by g of x t even x t plus one so it's simply math i just shuffled around terms now i ask myself so if g is something random so let's say i have a configuration my proposal rule g is simply to pick one spin at random and flip it okay so my proposal rule is in a sense that given one configuration x t i randomly pick one spin and i flip it okay so with probability one over n i flip the first one with probability one over n i flip the second one with probability one over n i flip the third one is it clear so this is the fine i define g this also define the g reverse which essentially is the same right the probability so always one over n so this term in general cancels out at least for those symmetric rules i need to find what should be a what is the probability to accept it but if i want to choose a value ever like some form of a i need to still respect this detail balance condition so in a sense i need to choose a form a mathematical form for a but still respects this ratio what can i do well hustings that proposed a very long time ago one expression for it it's not the only one there are many others but this is the one that is used the most and has a lot of let's say later on there was a lot of mathematical motivation why you should use it but it's not the only one and it is to say that the probability to accept the configuration t plus one given your current configuration t is the minimum between one and this blob here so p of x t plus one divided by p of x t times this ratio which usually is just one if your transition is symmetric okay why does this satisfy the condition well simply because if let's say that this thing is lower than one right so it means that this thing is equal to this side so they simplify so i have one over a reversed equal one but now if this is large if this is smaller than one it's inverse it's larger than one so the denominator is one okay so i'm going a bit fast because i want to cover other things but if you want ask me over time it's if you think a bit about it it comes to mind it's really it's it's just game and then if you use this acceptance probability essentially you respect detail balance and so you kind of have an algorithm step one you have a configuration a completely random one step two use this g that you decide to propose a new configuration the simplest thing you just flip a random spin step three well this ratio is one so you just compute the probability the probability amplitude for this new configuration and for the old one and you compare whether this ratio is larger or smaller than one and you accept it with this probability so if it's larger than one you always accept it meaning if you go to a configuration that has more probability you always accept it if you go to a configuration that has a lower probability you might accept it or not with exponential smooth okay and now what is interesting is that if you do this many times you build a chain of configurations that are distributed according to your distribution of course wow a strong italian accent of course there are many many details about this so this is in theory in practice you must endure a huge amount of pain to make this work because like if let's say i have a configuration of spins right i flip a random one and i accept this configure this new configuration so now i have like x one and x two will be two two configurations i differ just by one spin flip are they correlated yes they're not independent samples right the correlation time in this mark of chain will be quite large so you need to decorrelate them so usually you don't need to just do one flip you need to do a lot okay so you need to apply this algorithm many times to generate two samples that are not really the correlated but they look like as if they are the correlated okay so this is called sweeping or and so the number of sweeps you do like the sweep a sweep of the mark of chain Monte Carlo algorithm usually includes running this algorithm many times taking many steps and so in practice you will generate a series of configuration let's say x zero so let's say let's say that i'm sampling i don't know this is the configuration that i'm sampling this is the sort of mark of time and i sample my chain right those are all configurations in a sense i would not be using for my Monte Carlo estimates all of those configurations because they're correlated so i will pick i don't know the first one they not usually we discard the first part because we initialize the mark of chains from some completely random configuration that is probably not good not distributed according to the distribution so we discard the first part and then we take i don't know this first entry then these and then these at constant intervals okay so we sub-sample the distribution is it clear so this is to increase the correlation to decorrelate the samples and another very important thing to remind is that depending on your Hilbert space it's very important or if you have some constraints say you want to perform your calculations in some manifold that well-defined magnetization so you want that for every spin up there is a spin down you need to make sure that your transition probability proposal respects this condition so if you want that your calculation is performing the constant magnetization manifold so where the spins up and spin down are in equal number you always want to propose a new configuration that still respects this condition or i don't know if you want your particles to be in a box you must always propose configurations that are inside of this box otherwise it's my background yes now why does this work because here i take a ratio of probabilities ratios means that the normalization factor cancels out the normalization factor that i do not know how to compute yes this is a secret ingredient in this because i take a ratio of probabilities here i do not need in absolute terms what is the value of the probability for this particular configuration i just need to know in relative terms which one is more likely okay this is what says me yes finally questions no no i'm talking about everything else next thing optimization of theta i'm saying you have theta you want to compute the energy sorry i in a second i will get there but the idea is that this is the energy right so i can either scan the full space of parameters to make a very nice like geographic geographic card of the potential landscape and find the minima this is very inefficient if i have millions of parameters what i will do i will take the energy i will compute the gradient and i will follow the gradient downwards and that's again a connection with machine learning right like these people minimizing the energy in physics call it the variational principle or the energy minimization of ground state variational principle people in machine learning call it gradient descent right minimizing a loss function the energy is your loss function are there any questions about Monte Carlo sampling and at every step i will do this for a different value of theta if not the last thing we have a break at 10 30 right yes okay so the last thing before the coffee but i hope you are awake by now is to answer this question like a bit better so until now you have let's say a way that if someone tells you trust me those weights correspond to the state you're looking for like those weights correspond to the ground state or to the first excited state or to a state at time t you have a way to compute expectation values and properties of the system you don't know yet and how to compute how to find those states let's compute the gradient of the energy i think i have a slide so i can go faster for this okay very quickly the variational principle right please raise your hand if it's if you have doubts about it and you don't believe me but essentially we want now i will just discuss one particular algorithm how to find the ground state like see it as okay i'm showing you this particular algorithm but try to grasp the bigger picture like this is how we derive the algorithm in this case you can always use the same ideas to derive pretty much over algorithms but find other states that were interested variational principle in physics terms like if uh like the energy as a function of a parameter w or theta whatever is always larger than the ground state energy i think this should not surprise anyone if it does raise your hand please so in a sense i can if i am a machine learning person i don't call it energy you can just tell to your machine learning friend like this is the loss function the energy is the loss function i minimize it right how do we minimize it we need the gradient so conceptually what happens is that i will initialize my neural network with some random parameters theta if i look at how my state overlaps with all the excited states of my hamiltonian essentially i will have some sort of like concept of completely random overlap with a different eigen state i'm very high up on the energy spectrum then i i compute the gradient which i will tell you how in a second i follow this down i start to increase my overlap with the ground set and diminish it with a highly high energy ones and eventually if everything goes well which beware doesn't always happen i convert to the ground state or something that really looks a lot like the ground state of course most of the time my neural network is not so powerful that it can represent the ground state but it can represent something that is very close to it and because of a variational principle i kind of know that the lower the energy the closer i'm getting to the to the target state to the ground state yes yes amazing okay and please don't ask me what happens if you're on the other side of this well and you get stuck in the local minima but machine learning people will tell you that in general if you have a very high dimensional function right not one dimension but if you have a lot of parameters the probability to have a local minima goes down how anyone heard something no exponentially so because if you want if every direction of the gradient if the gradient is kind of a random variable to have a local minima you need to have that all the partial derivatives are zero at the same time and this is exponentially unlikely so in general when you have a very high dimensional functions to minimize you will never have a local minima so you will it will be more very unlikely to have local minimas you will have saddle points which are equally as bad but just maybe epsilon better okay so you will have this on one direction but the other directions will be like flatter will go slightly down and there's a whole set of problems about it but in general local minimas not a problem saddle points big problem okay how do we compute the ground state well how do we compute the gradient sorry this is what they want and actually let me write the gradient as t theta okay so i will use this as this i will actually use in my notation i will call this t theta and i will use d theta i as d i as gradient of theta i okay just notation so this is actually d theta so this is kind of a vector right and those are partial derivatives but i will never put the vector signs okay now if you remember i said before that psi is a function that takes whatever input right but let's say the psi i want now to differentiate two is as also another input that i didn't discuss before right it it takes a set of parameters times a configuration in the basis of the Hilbert space and it gives me a complex number in a sense it takes a theta times some configuration x right and gives me psi theta now this is a complex function there's a little detail to be discussed about what happens when i take the derivative of the complex functions because this is not so straightforward so one there are two cases that master okay and this is very badly written everywhere so pay attention for a second so we can assume that the weights are real so that this is some r okay then i might have to take the derivative with respect to theta of psi theta of x so this is complex but it is a real okay psi theta is not a holomorphic function because the weights are real so basically it's a real to complex function however this is always well defined if you want to understand it a bit better there's a there's a treaty about like the fact that people are obsessed with holomorphicity but this is not actually what is the important thing what we should care about is more like what it calls RC calculus and it's discussed in RC calculus by Delgado on the archive it's a very long thing but read very easily just read the first chapter which is about single variant calculus the rest is all the same just for multivariate in a sense what he wants to tell you is that this object is always well defined and and we have this identity right because d theta is real because theta is real right i can write this as d over d theta d theta is real so i can always bring it inside of this complex conjugation is this clear to everyone so this is one way i can do this calculation assuming real parameters the other way i can do the calculation is that i can assume that my parameters theta are complex okay now my wave function like now the gradient is no longer well defined because i have a gradient with respect to theta and the gradient with respect to theta star and this is related to holomorphicity on all those issues so every time i make the calculation assuming that theta is complex i will actually assume that also that psi is holomorphic with respect to theta what does holomorphic mean it means it respects the Koshiriman conditions which everyone likes to write in this very complicated form i like a very much simpler form which is d theta star of psi theta of x which means d in the theta star of psi theta of x is zero holomorphicity means that if you don't believe me just think about it for a moment or come ask me essentially it means that your wave function or neural network does not depend on theta star so if you don't use theta star in your network your network is holomorphic cases that use theta star every time you take a real part an imaginary part or modulus square okay so if you don't use modulus square if you don't use the real part an imaginary part your network is holomorphic and you can do the calculation like that if your wave function is not holomorphic everything is more complicated but you can just imagine that you split your parameters into real and imaginary part and they're all real so you have n complex parameters you split them in two twice and real parameters yes essentially there are only two ways to do this calculation that matter one is assuming real parameters one is assuming complex holomorphic wave function okay yes i will do the calculation with this approach because it's super simpler you can also do it by yourself with this you will find essentially the same formula plus the real part yes in most books you will find this formula however i will use this one good with that said i have this right so now what happens when i take when i assume holomorphicity well the theta star of psi of x is actually the theta star of psi of x which is zero correct which also means that the no yeah this is enough okay is it clear for everyone yes second thing if i have complex parameters and i want to minimize the loss function i need to minimize i need to follow not the gradient but the conjugate gradient is this clear for everyone or do you want to know why no that's the point so let's say i have e is a real function okay it goes from complex to real so e of theta plus delta theta minus e of theta is a real number right and this is how this is how my energy changes if i fall if i update my parameters according to delta theta if i expand this to first order i get e of theta plus delta theta times the theta of e theta plus delta theta star times the theta star e theta plus order of e square sorry delta theta square minus e theta and let's take an absolute do you agree the reason is that e is a real output function so it's not holomorphic so i need to take both derivatives so e of theta cancels out and now e is a real function right so d in d theta of e theta and is equal to d of e theta in d theta star because e theta is real so i can rewrite this as modulus twice the real part of d theta times d theta e theta and now i modulus right and i can ask myself what is the direction d theta where the where this function changes the most so when do i when do i when this value is the largest for which d theta of unit modulus so now i can always assume that this is smaller than twice the modulus of d theta times the modulus of the derivative respect to e theta right and now which delta theta saturates with inequality well the one for which is real the imaginary part of this argument is zero so this is saturated when delta theta is equal to d theta star of e theta yes so this proves that if you want to minimize a real valued function of complex parameters you want to follow the conjugate gradient and not the gradient okay you don't need two gradients because the tangent space of the object you want to minimize is still two-dimensional you have two three parameters two three variables because you have a real output you just need two numbers to expect to express which direction to follow this is what delgado discusses in this rc calculus the fact that you don't care about telemorphicity when you have a real valued loss function but you just care about the equivalent of rc analyticity so the fact that you can still well define a gradient with two components and this is still encoded as a complex number but it's a good point it's really not here okay so if you believe this calculation essentially it tells you if i have complex parameters i want to follow the complex the conjugate gradient of the energy i want to compute this gradient now i can apply the chain rule i can do the theta star of psi theta let's say times h psi theta divided by psi theta psi theta plus psi theta h d theta star psi theta divided by psi theta psi theta minus p of theta d theta star of psi theta psi theta divided by psi theta, psi theta. OK, yes? Now you remember, that's why I brought this condition. So d theta star of psi theta is 0. OK, that's because the function is allomorphic. So this term here is equal to 0. Because in a sense, I'm thinking of theta and theta star as independent variables, if you want. This is psi theta, depends on your psi theta. This is the conjugate. So it actually depends on theta star. And so the derivative with respect to theta star is not 0. And so what I get is the psi theta h psi theta divided by psi theta psi theta. Here, I lose the star because I bring it inside the ket, which is conjugate. This is 0 minus e of theta. And here, I pull off the same trick. So the theta psi theta psi theta divided by psi theta. Yes, yes? OK, if you did it not assuming allomorphicity and with real parameters, you would get the same thing and the real part. No, I observe it back into e theta because I get numerator divided by denominator squared. So this is the energy again. So now I have brackets again. Those brackets, if I plug the identity inside their expectation values, and so they have a sum, so they're the full Hilbert space. They're very hard and very expensive to compute. I don't want to compute them. That way, I want to estimate them with Monte Carlo sampling. So I need to pull off the same tricks I did down there to obtain some probability distribution on which to expect and rewrite them as expectation values over the born probability amplitude. So this is what I will quickly do now before we can take a break. I think we're not over time, right? Yeah, come on, just one minute. I have two terms to compute. Let me do the second one, which is easier first. D theta psi theta psi theta divided by psi theta psi theta. If you remember, I said this denominator here object, I don't know how to compute it. So it's what I want to make disappear, right? By absorbing into a probability distribution. So what I will do is I insert an identity here. So I get d theta psi theta of x, x psi theta divided by psi theta psi theta. And I pull off the same trick with the same color. I multiply and divide by psi theta x psi theta x. So this is now a probability distribution. And I get sum over x of p theta of x times d theta psi theta x psi theta x, clear? And now this is just an expectation value of x sample from the born amplitude of this object, which if you want, is just d theta star of log psi theta of x. Do you agree with me? Is it clear? Because this is d psi theta divided by psi theta, so it's the log. Yes? Good. So to estimate this object, I just sample a bit and I compute the gradient at both entries, OK? Not everywhere, but just as both entries. And this is correct. This I'm allowed, right? Because if I can divide by x, because here there is psi theta of x. OK. And now, last thing before lunch, before break, d theta psi theta h psi theta divided by psi theta psi theta. Again, let me put an identity here. OK. I will write it that way. And now I will pull off the same trick again. However, here, this is what I want to make this appear. There is no psi theta to square again, so I will do something horrible. And I will simply multiply by psi theta of x square. And I will divide here by psi theta of x, and here by x psi theta. OK. Please remark that I'm not allowed to do that. So police should like now lock me up. Yet we do it all the time. It's wrong. And it does cause issues. OK. In particular, it's wrong whenever we function as some nodes, but the derivative is not 0. We are aware of it, kind of. OK. But we accept it. You could have put the identity on this side when you would have a psi theta of x, when you could have pulled off this trick. We discussed it in a recent paper. Yes, it's true. You are allowed, but it doesn't work as well. The reasons I will discuss later. OK. So this is what we do, even if it's slightly wrong. And again, now we get, right, this is a priority distribution, so it's the expectation value of x over this priority distribution of the theta star log psi theta of x. And this is my h log. OK. So if I put together those two blobs, I get that the gradient with respect to theta star of the energy is the expectation value of x from this distribution, from the Born amplitude, of d theta star log psi theta of x times h log of x minus the expectation value of h log of x over x. I went a bit faster because I'm out of time. But so please bear with me. The idea is that here I have this d theta star log psi theta, which is the same object I have here. So this h log is this one. And that one was multiplying the energy, right? But the energy is nothing else than the expectation value of h log. So here I have h log minus expectation value of h log. Does it make sense? No? Why? Someone? OK. So now this looks a lot like a covariance, right? So in a sense, if you want, I have a times b minus b, right? You can show very easily that this is also equal to a minus expectation of a times b minus expectation of b. If you don't believe me, you can just do this quick calculation. And essentially, this we usually call a covariance of a and b, OK? So I can always say that the gradient, or better yet, the conjugate gradient is actually the covariance sample from the probability, one probability amplitude of d theta star log psi theta of x and h log of x, OK? And the fact that this is a covariance, it's a very beautiful property because it makes it such that it's very resistant to sampling noise. And it has a very low, let's say, variance, this estimator. And it is why if you had put the identity here, and you got a different expression for this quantity where you could divide and multiply by psi theta of x, you would not get in the end the covariance. And so you would get another estimator that while correct, so unbiased, it has very bad noise properties. But to conclude, the last thing I want to say, this gradient, which is the gradient we estimate, is beautiful. And if you show it to a computer scientist, he will hopefully be very happy for one particular reason. In machine learning, when you take a data set, you start to do your gradient descent and you get to the bottom. You know very well that if you don't stop training at some point, you start overfitting your data and then you go like, you get some very bad results. Instead here, if I converge to the ground state, h log of x doesn't depend on x anymore. That's because xh psi divided by x psi, I saw you as theta. Sorry. It's this if I'm on a ground state, right? If I'm on any eigenstate, this is a constant. So h log is equal to the energy of my state, so this object is zero and my gradient becomes zero. So if I hit any eigenstate, this gradient becomes zero and I stop optimizing and I sit on the solution. Thing that doesn't happen in machine learning. This is a very beautiful property. It comes from a fact that we don't really have a data set. We resample our data in a sense, which are our samples every step from the distribution. And so once we converge, our gradient becomes zero and we stop. This also has other implications, which is that the variance and the noise of this becomes smaller and smaller as we converge. And yeah, it becomes smaller, and so actually we have nice properties I will say later. So let's have a break. Sorry.