 The next speaker is Agnes Valenti, who will be talking about optimization landscape in neural network quantum states. Thank you so much for the introduction and thanks a lot for the organizers for having me and I'm excited to be in Trieste, very beautiful weather. Okay, I'm going to talk about how to understand a bit better what goes on during training if we use neural network quantum states, but I want to say this is very much work in progress, so I'm not claiming to have, we are not claiming to have any answers yet, but I thought it's nice to talk about it anyway in order to have a productive discussion maybe about this topic. So this work is done together with a bunch of collaborators and I think half of them are sitting in the audience right now, so if you have any questions you can also talk to them. So let's start. In the previous talk there was already a very nice introduction into neural network quantum states, which makes my life a bit easier, but I'm going to reintroduce them anyway in order to make sure that everybody is on the same page. So what's the problem we're considering? Well, we want to solve a many body problem, in particular here we focus on the time independent Schrodinger equation. So we have a Hamiltonian and we want to find the ground state wave function of this Hamiltonian and the associated energy to it and maybe some excited states, but for now let's focus on the ground state energy. Now the problem we have is that this object, the wave function scales exponentially with system size, so we all know that if we go to larger lattice systems then we cannot solve the system exactly anymore and we run into problems. Now luckily this is not the end of the story, otherwise we all would only build quantum computers, but we can look at it in the following way. So if we consider this huge exponential hybrid space, what we can think about is that we're not looking to represent just any state in this hybrid space, but rather some more or less specific sub-manifold of this hybrid space, which corresponds to the ground state of physical Hamiltonians. And physical means maybe local, maybe a specific subset of systems that we're interested in. And then the hope is that maybe for the sub-manifold we can find some sort of parametrization for it that does not scale exponentially with system size. And then we can make use of the variational principle to optimize this parametrization and find the ground state wave function. So this is basically an idea that has been around for a long time and people have worked on it for a long time. Basically it corresponds to the whole field of variational methods. And here we're going to focus on variational Monte Carlo where basically we don't have a restriction on what sort of parametrization we use since we estimate all relevant expectation values with Monte Carlo sampling. And not so recent any more development was to say, okay, we can use parametrization for this wave function which is as generic as possible because we can think about the history of variational Monte Carlo and the field started out with very physically motivated wave functions such as Slater-Jastrow-Gottzweiler-Projected wave functions which offer great physical insight but if you use them alone at least they also come with limited variational freedom. There's another class of variational methods not using Monte Carlo which are tensor networks and while they're extremely powerful we know that their limitations correspond to are related to entanglement. So here the idea was that if we use very generic parametrization neural network as our variational ansatz maybe we're not as restricted and can go to systems that we cannot address with other methods. And how does it work? Well we have this neural network it takes as input a spin configuration so now I'm just talking about spin systems and will output potentially complex amplitude which then goes into our wave function since we expand it in some sort of basis for instance in that basis of our spins. Now we can relate this problem to classical machine learning and reframe it a bit and say okay what we do is we actually minimize a cost function and the cost function is just the expectation value of the Hamiltonian and this cost function in every step we estimate using a finite number of samples which are our Monte Carlo samples. So if we want we can say well that's our training data it's not a one-to-one correspondence because what we do is we generate this training data on the go from our neural network but we can anyway ask the question okay you know maybe there are some connections to classical machine learning. Now neural network quantum states have had great successes for hard problems where other methods fail. However hard problems are still hard problems and also neural network quantum states face limitations there. So here just some examples for some more recent or less recent results of neural network quantum states on frustrated systems and what's typically observed is that while you do get state-of-the-art accuracy the accuracy is at the highly frustrated point is typically quite a bit lower than if you're out of frustration and similarly you can look at other systems and not always is it guaranteed that your neural network wave function will find the correct ground state to let's say machine precision and you can ask the question okay well what what is it that can go wrong during optimization. Well firstly it could be that our ansatz is just not expressive enough meaning that it spans some sub- manifold of the hybrid space but it does not contain the ground state. So if that's the case you would optimize optimize but you would end up at a finite distance to the ground state so maybe you just do not have the correct the correct ansatz. Secondly you can also think that maybe it is expressive enough but maybe you get stuck somewhere so maybe your optimization lines get somehow weird or you don't see it correctly during training because of the finite number of samples and you get stuck and you don't manage to get out even though your ansatz in principle would capture the ground state so that would correspond to issues with trainability and there has been quite a bit of work on examining what could it be where are some limitations of these systems where does the problem lie and the answer so far is maybe everywhere. So here this is just there's been a bit of work these are just some three main papers. The first paper looked at reconstructing quantum sets of frustrated systems not by minimizing the ground state energy but by actually directly minimizing the the infidelity between the exact ground state for small systems and what they found is that the network struggles to generalize to the correct ground state for if too little amount of samples is used so they found actually a critical number of samples that is needed to generalize to the correct state. Then another paper found strong indications that made problems also rise from a landscape that is not smooth but rocked in a particular way and thirdly this recent paper claims that the problem my also one of the problems may also land the expressivity because what they see is that if they increase them the number of parameters a lot then the the effective expressivity in terms of the hybrid space does not increase so what does it tell us it's a hard problem you cannot find one single answer but you can try to understand it a bit better and try to bring things together so I'm not claiming that we do that yet but that's what we're trying trying to look at and in particular what we're trying to look at is the following thing so let's say we have all of these problems you know there's expressivity is somehow not good your landscape is not smooth what do you do and these are problems that do appear in standard machine running and one common answer that people have is well you know more is better let's just increase the number of parameters and samples a bit more complicated but let's say let's just go to an over parameterized regime and what you see in classical machine learning typically is this phenomenon that of what they call double descent which means that in what they call classical regime you increase the number of parameters of your model your test error and your training error both go down but at some point you're overfitting and then you test error goes up again but then if you continue increasing your number of parameters you go into a new modern interpolating regime where you test error goes down again and this is where you're like truly modern over parameterized further people find that if you extremely over parameterized then your optimization landscape gets quite a bit smoother and it's easier to to optimize your system now for new network one of states that's of course not clear at all what it means first of all what is this what is this thresholds after which we can say we over parameterized so is it the number of samples in your herbal space if that's the case then we can have over parameterized because we wanted to get rid of the exponential scaling in the first place or is it maybe a relevant set of sub samples that only occur when you sample from ground state that's not clear but we can try to look at this and understand a bit in which regime we are in second there is not only a dependence on the number of parameters but there's also a dependence on the number of samples and if you look at the total number of samples that you see during training then a similar statement as for parameters is true more is generally better and you might even see also a double descent phenomenon as a function of total number of epochs that you train your model for however if you look at the number of samples seen per iteration then in classical machine learning typically batch size noise helps meaning training with smaller batches it gives you more accurate results in the end and this is also not clear how this course how this translates for new network condom states is this also the case or is just more always better in any case so what what's how do we approach our comparison or study or so what we do is we take a set of architectures there are of course many successful architectures we just pick three of them and look at a Hamiltonian that is very hard and try to to study a bit what's going on so what do we consider we take three architectures the first one is the restricted Boltzmann machine which is the first architecture to use for the variational energy optimization with new network quantum states and we can understand it as a single layer feed for new network with a specific activation function and skip connections second we use group convolutional networks because they have been very successful for frustrated models which basically in a simplistic view you can understand as a generalization of convolutional networks but you can have other symmetries then point then then translational symmetry and third to have something that behaves a bit different we compare with a recurrent neural network and in a recurrent neural network the things are qualitatively a bit different because it has an auto regressive sampling property meaning you can directly draw sample samples from your network and there is no need to to construct a Markov chain so you would have a set of uncorrelated samples which is qualitatively different to the other cases and what Hamiltonian do we choose we tried one of the hardest things that you can do just to really be able to see problems which is the anti-firm magnetic Heisenberg model J1 J2 on the triangular lettuce why is it so hard because first of all there is geometric frustration in the system because of the lettuce and there would be in theory a classical need order which you see here with the spins it's unstable due to quantum fluctuations but secondly there is also a frustration induced by this J2 term which highly unstabilizers the this need order so it's kind of two way frustrated and which is one of the reasons why this model is hard to solve accurately and we're not trying to actually solve it accurately we're just using it as a hard test case for our network ansatzis so in general if you're going to if you say you trying to solve a problem and you have your network ansatz you're going to face be faced by two limitations computationally one is the number of parameters that you can use and second the number of samples that you can use for each batch or the total number of samples that you see during optimization so the first step with that way that we can take is to try to see just very simple that some something that most people do when they when they put their new network how does the energy depend on the number of parameters and on the number of samples so now for this particular model we use RBM RNN GCNN and now you should not compare the absolute values of the of this of these ansatzis just the RBM and GCNN because the RNN is taken for an open system that is a bit smaller chosen such as the effect if it were its base size of all of three months is roughly equivalent but it's not symmetrized so it's just to see the trend but not the absolute energies and what do we see well it seems like if you increase the number of parameters so this is like a log plot so you have to be careful with the with the slopes you do improve the ground state energy but it flattens out quite a bit if you go to a large number of parameters and just to get the order of magnitude of this the effective Hilbert space size for the model we're looking at for 24 spins with symmetries is about 50,000 so it's about here so even here you already have more parameters than your your total Hilbert space size but the energy is still 10 times 3 now you can make this better by including for instance more symmetries and by going to more elaborate ansatzis but if you don't and just blindly increase the number of parameters that's what you get so it seems a bit that you know is it really the right solution to just blindly increase numbers number of parameters and achieve over parameterization in that way second we can look at the same plot as a number of samples and here what you see is samples per iteration but the total number of iteration is fixed so it also sees more samples meet all more samples means both larger batch size and more total samples in total and interestingly there is quite a strong dependence for the both the RBM and the GCNN in training with more samples which seems more relevant here than increasing number of parameters even but the RNN does not depend on sample so much which we don't have an explanation for that's just what it seems to be this can either be good or bad for the RNN you could say okay you know then you can train your model for with a small number of samples and get two similar results but if you're not able to improve it that might also be a problem now what we can do is we can compare this to classical machine learning so what's trained here is a three-layer feedforward neural network on the task of predicting the the house market values of the California housing market and the problem that that you see on the right hand the problem with the right hand side plot is that we didn't yet make the plot that we actually need to do which I'm going to explain on the left hand side so what you see here the upper plot is basically exactly the same experiment as on the right hand side meaning increasing the batch size and keeping the total number of iteration fixed so what's happening is that the model sees in total more samples and also for the classical case the loss value goes down and you get a better you get a better prediction if you use a larger batch size now the plot that we still need to make for the NKS case is looking at the dependence of the of the batch size if you keep the total number of samples fixed meaning if you have a small batch size you train for more iterations and then what you see classically is that mini batches help and the loss actually goes up if you increase the number of samples in your batch so this we actually don't know yet we would suspect that we see an opposite trend for the neural network quantum quantum states that mini batches might not have but that's something that we don't understand yet but it's simple to try we just didn't do the simulations yet so the next thing we can we can try to do is to see a bit to go a bit more deeper and look what happens during training and how does our landscape look during training and what we do is like a bit similar approach than done in this paper but looking at the whole training trajectory and for different in a slightly different setup so what we do is the following we take many random initial well many is ten random initializations of our of our wave functions and then just optimize these random initializations and then what will happen is they will end up somewhere in the next to the ground state they might end up in the same minimum they might end up in different minimum that's something want to probe and understand also they might take the same path or different paths through the optimization landscape so the question we can ask here is whether are they all the moment you have a slightly different initializations do you separate in a different part of the optimization landscape and never see each other again or is there some coming together at some point and how we probe that is we freeze a set of samples 8,000 samples and input the same samples through saved models during during optimization so we would like obtain for each point that we save the model for 8,192 complex amplitudes that would correspond to one data point and basically this is kind of a snapshot of how does the wave wave function looks like but we cannot put through an exponential number of samples which is why we freeze a finite number of samples and then what we can do is to say okay let's have all these data points of the different trajectories during optimization let's just cluster them and see and see how how the cluster looks like so this is for a GCNN clustered using UMAP and different trajectories corresponding to different random seeds are colored with different colors so what we see is kind of they tend to cluster somewhere else and there are some that come together here and here but generally the trend is that to stay separated in optimization landscape and there are many questions one can ask here for instance how these state physically differ so is are these clusters maybe separated by different sign structure and is it maybe not possible to go from one to the other because it's hard to change the sign structure or is it some other physical quantity that separates these clusters which is something that would be interesting for us to look at but we didn't do that yet so but in general it seems like it's not a smooth trivial landscape with one global global minimum but it also means that yes yeah I mean I mean I would say like I would say you know if you're truly over parametrized and it's true and if it's true that for new network quantum states over parametrization also means your landscape is smoother then this would mean you're not over parametrized yet but we also don't know if it really means if you over parametrized your landscape is smoother so this is like not really well defined for NQS but I would say yeah so maybe NQS just you cannot reach the over parametrized regime because this already has half the number of parametrized and you're able to fight this space size we have tried it for the Heisenberg model triangular with J2 equal to zero but actually we have just tried it like for one model and no conclusive results yet so no yeah but yeah that's something we need to do yes what do you mean like this is a fixed sample and fixed parameter oh oh it's the so I'm putting through the samples and I get the the wave function amplitude so it's a wave function clustering my dear collaborator perfect all right good so what we can do next is to say can we quantify this a bit more so maybe things are separated in the optimization landscape but how rugged if you want does the landscape really look like if you're close to the ground so what about the the final states and for this we can define a this is not a very good let's say order parameter for this but it's the first order thing to do we can do the following let's just look at the variance of the final energies so meaning if you have many different final energies that's different from each other a lot then your variance of the final energies of your optimizations will be higher than if you all go to the same minimum that's not a perfect order parameter you can always think of the case that you might not be able to distinguish if you just go into two clusters that are kind of far away or have many distinct things in one cluster but we can do that as a first step and plot the energies and see which of the cases we're actually looking at and what we can then further do is look at the dependence on the number of parameters and samples of this of this thing so before we do that we have to understand what's happening and if you take your many random initial seats what will happen if you just plot where things end up is that there will be a large cluster somewhere around the ground state and there will be a second smaller cluster for this particular case there will be a second smaller cluster where things I think they get stuck in an excited state or something and this you also see if you cluster just the final states now we have more points because we do this for a different number of parameters and samples or if you just scatter the the energies for one particular case of this you see there is a there is a larger cluster where things end up close to the ground state and there is a smaller cluster above I'm saying this because in order to look at this ruggedness we we we can now differ okay how rugged is the landscape next to the ground state and how is it in total so we will just do this this this variance for both cases and this is what you see on the right hand side as a function of parameters and as a function of samples and actually you know 10 random initializations is just too small to get good statistics to to look at both clusters together so just ignore the the black triangles and let's have a look at the at the first cluster next to the ground state so how rough is the landscape next to the ground state and how does it depend on number of parameters and what we see is that it does smoothen out up to around here which is kind of interestingly more or less the hybrid space size but then increasing the number of parameters does not further smoothen out the landscape so much now again statistics are poor here one has to do this with more initializations and so on and so on but this is like an indication of the trend that we see in this plot however for the number of samples kind of unsurprisingly the landscape does smoothen out which makes a lot of sense because you just see it better and you might not think that you're stuck in a local minimum which is actually not a local minimum so if you increase the number of samples your your variance of the final energies does decrease quite a bit and it seems to have a slope that goes on and on and it just wants you to increase the number of samples more so that's a trend you see and with respect to over parameterization here you could also say you know maybe this is some flat plateau until and you would need to increase the number of parameters further and further until you see a descent again so in order to understand what's what's going on a bit why does it get so much smoother with increasing number of samples a simple thing to do is to just consider okay how well do we actually approximate the gradient in each iteration so what we can do is we can we can so you know we do we do so as a configuration or minutes are and we follow we follow the gradient effectively with some metric of the of the of the neural network space but so it's important to in each step be able to actually follow the correct gradient and however we use a finite number of Monte Carlo samples to estimate that gradient which will probably which will introduce an offset with respect to the exact gradient and secondly there also has been a recent paper on there being a bias in estimating the gradient because you sample from the wave function and not from the actual gradient so you might imagine that there are the wave function has a lot of zeros or something where you gradient is non-zero but your wave function is zero and then you miss these samples during during training here we are not there yet so all that we do is look at the sample dependence and look at the overlap of the sample gradient with an exact gradient and the smaller the overlap the the worst we're doing so you could imagine that you're maybe close to some minimum the exact gradient points towards this minimum but maybe your sample gradient is so bad that you accidentally go somewhere else or get stuck in moving around here and what we can do is look at this as a function of training iterations so here this is the same GCN and that I clustered before and at zero iteration I'm plotting the gradient overlap with the exact gradient as a function of samples that I'm using and there all seems good even for almost no samples your overlap is basically one so you go somewhere in the right direction however when you start training this gets worse so that's it at hundred iterations you already need quite a bit of of hybrid space size fraction to get more or less accurate gradient overlap and this continues if you if you go to larger training iterations so it means you go through some regime in your in your training landscape where suddenly you need a lot of samples to accurately estimate your gradient overlap which means that what you can think of happening is that the variance of the energies if you take many random initial seeds is high because you just don't realize that you're actually close to a local minimum or you think you were in a in a local minimum when when you actually don't and there is a cliff next to you that you just don't see next there is a second quantity that can tell us a bit more about about the structure of the landscape which is the quantum geometric tensor so what's that we can understand this in terms of distances in our parameters space so let's say we have all our neural network parameters and we want to typically if you say okay we have all our neural network parameters for one ansatz and then we have a second ansatz what's the distance between them well we can just take the Euclidean distance of the parameters now this might not directly translate into the distance of the actual wave functions that these parameters lead to so the wave functions you know go into your network this outputs a wave function amplitude but how do the two states of with the two parameter sets differ and in the herbert space then we do have measures to quantify distances typically related to the fidelity between two states so one common metric that people like to use is the Fubini study distance which is just our cosine of the fidelity between the two wave functions meaning if it's low then the fidelity is high and the states are close to each other and vice versa and with this understanding we can define a metric an infinite infinitesimal metric tensor so we want to translate an infinitesimal variation in parameter space to an variation in herbert space and you know we can just write down this infinitesimal variation herbert space and get this metric tensor which we call quantum geometric tensor or correlation matrix or covariance matrix whatever you like and this is also the same tensor that he used during statistical configuration optimizing optimizing your wave function so this actually is not only relevant to characterize your landscape but also yes so has a great configuration yes so right and this yes yeah yeah yeah I agree so what we actually should do is not is not compare the the just the plane gradient but we should should compare s to the minus one gradient absolutely 100% agree we just didn't do it yet right so so so basically but what we can do in the meantime is have a look at this at this object the quantum geometric tensor itself because it will tell us something about what are the relevant directions in my in my parameter space so one thing that one can do is look at the rank of this object the quantum geometric tensor what does it mean well the rank of this object would tell us how many relevant directions can I take in parameter space that would actually lead to a difference in my wave function so if it's very small it means I have a lot of redundant directions in my in my in my parameterization and I might increase the number of parameters and just add redundant directions so first of all we can look at it as a function of samples now the rank is strictly bound by the number of samples so it will always be smaller or equal than the number of samples that are used to estimate it but interestingly it's also it seems so you see the rank as a divided by the number of parameters so this diagonal thing would correspond to the strict bound by the number of samples but interestingly while you do increase the rank with number of samples you also stay maybe a bit flat and are quite below this this strict bound that you have by as a number of parameters for this model if you go to larger system sizes this gets a bit you use more relevant parameters but for this particular model where even the energy is not very good this does not seem to be the case so one can think one can think of this as you know if the if the if the rank of the geometric tensor is is very small then you might not see the the curvature of your landscape accurately now this is something that another thing that has been done in this paper for the case of an RBM force mode system is look at the dependence of the rank as a function of parameters what does it mean well if the rank so here we see the rank as parameters plotted for the GCN and RBM and we see that there seems to be some critical number of parameters at which the rank flattens out meaning if you increase enough parameters you don't seem to add relevant directions to your way function that's not completely clear there would be more studies to do one would have to do to see what actually happens when increase enough parameters because it's not a fair comparison but there seems to be a trend that maybe maybe what happens is that you're able to fully parameterize one part of your way function maybe the amplitudes but if you increase the number of parameters blindly some other part of physics you don't capture at all and it's not the right direction to do so this basically calls for maybe more yes yes this was so for the for the for the for the correlation matrix was larger than than the rank of the of the S matrix was like 50,000 yeah okay so it means basically many open questions there seems to be dependence of the rank of the geometric tensor with number of cents and of number of parameters but it's not clear to which point you can keep adding relevant directions yes no this is why I'm also saying that this is not complete yet so what we do here is only look at the final the final state and compare the rank of that so if we increase the number of parameters then we re-optimize and look at the final rank so it's to me also not clear yet whether this really means that you're not adding relevant directions because what do you it's hard to it's hard to make this comparison right because if you take an ansatz and then you just add parameters and zeros that might also not be the right way to add around the back like you know what I mean yeah so it's it's a bit hard comparison to make yes yes so this seems related here okay so I think with this I can come to my summary now there are many many red marks here because we have many open questions what we looked at is that the dependence of a number of samples and parameters of certain quantities in the landscape and what we saw is that if you increase the number of samples and parameters in general the ruggedness of the landscape will decrease trajectories cluster seem to cluster in the landscape and there is a is a is a very crucial dependence on the on the gradient overlap in the QGT rank as a number of samples now there are many open questions for instance what about batch size versus epochs how does the choice of sampling matter do you need to introduce sampling if you if you want to sample the gradient that is not from the way function it is also not clear that whether we are able to reach an over parameterized regime and what it actually means for new network quantum states and how these results scale with larger system sizes if you use for instance help the network with the Marshall sign rule change the choice of Hamiltonian and so on but yeah I I hope this can maybe lead to some discussion and thank you very much for your attention and please feel free to ask questions thanks a lot Agnes for the really nice presentation so there were already a few questions but we still have time for I thank you very much for this interesting talk so I have a question because you raised it actually what are the differences in the different ground states found so and because you you're linked to somehow the quantum geometric tensor in the end because if you think of the quantum geometric tensor from a topological side of you you can link it to topological properties of the state have you studied also like topological models I would say and and and how how they maybe maybe you can link like these top logical properties to the to the quantum geometric tensors and then back to the NQS that sounds very interesting thank you for the input now we have not done that but I mean there are topological models that people typically to study with NQS that we can try but I mean for the yeah we have we have not looked at any of that way of that things but yeah that's an interesting question you raised thank you very nice talk thank you so I was a bit curious how you increase the number of parameters because you could increase it by just increasing you fix the layer you increase the number of neurons or you basically make the neuron deeper your neural network deeper and they have different skillings yeah yeah so for the for the GCNN what we did is I tried a bit we tried a bit around with increasing the the size of the of the layers and increasing the size of the network right now these plots increase along the diagonal both both layer size and number of layers which seems to indicated be the most useful direction but we have not fully explored this yet so they might be indeed a very different scaling now you can understand the RBM as a one an extreme case of that right you have one layer and then you just increase the size of the layer and it seems to be from the plot that RBM in terms of scaling behaves kind of similar as the GCNN so maybe yeah I'm actually not sure yeah but we we did these two cases diagonal and one layer thanks hi thank you for the nice talk in all the plots that you showed the RNN was significantly significantly different than the other two architectures that you were considering do you think this is the reason is the order regressive property or do you have any idea what's going differently there yeah now that's a great question I mean so the RNN is also very different in its parameterization right because what it does is is you have to you have the sequential correlation with with your former spin neighbor so it's very different in how and how you parameterize and yeah we talked a lot about this with one carousels group or the RNN experts and it's it's a very curious question because there are also things like what I didn't mention is the RNN is not trained with stochastic reconfiguration because it's much worse if you train it with stochastic reconfiguration as you may know but it's trained using Adam and it's not clear why this is the case it may be that you just have many redundant parameters in your RNN and the parameterization is just not efficient in that way so you would need to come up maybe with a more physical motivation of that but maybe it's alterated to autoregressive sampling that's something that's very open and its own research project I think on its own thanks ask quick questions I thank you for a very nice talk clearly you have used a pretty challenging network which I understand was very much the point of the whole endeavor but what I wonder is have you perhaps investigated the ruggedness of the learning landscape with respect to the variance of the of the wave function amplitudes say you have a very small variance perhaps it's very it's easier so it is proportional to that have you looked at that in any way I see no we have not it's interesting to look at also the question I guess would be you know maybe the variance of the way function amplitude of your final state is that way is you know it's like has a very small variance or has a very high variance but maybe you still need to pass through high variance or small variance part in the optimization landscape so there is a paper by Cheyoon Park I think that that what I find that you do go through a point in optimization landscape for I don't remember which Hamiltonians they look at quite a lot of Hamiltonians you do go through a point in optimization landscape we have like a fat tail in your way function even though you find a way function might look different so I agree that like that's a very interesting thing to look at so as you pointed out correctly in the beginning I mean the main difference between these and supervised learning that is that you don't have a training set and you generate it on the fly by a Marco Chemo to Carlo basically so I mean I don't really understand when we it's an open question like how do you define the over parameterized regime in this context because I mean usually it's like you have a fixed training set set size and you compare these with the number of weights and biases in the network while here it's not clear and also concerning the batch size I mean even in that case it's not really clear how to compare with the number of training data right yeah great question I think that's exactly the point so it's not any idea I mean I mean so it seems like so I mean in classical machine learning right you would do a number of total training set size times a number of output classes and this would be kind of your number of parameters after which you start being over parameterized here it seems like we don't really reach parameterized regime we're even larger than the habit space size so you could say you know the habit space size could be it you just need to see all the samples and you know maybe this gradient points a bit in the wrong direction but this would point to the other direction it will go down but that's not clear but I'm actually not sure if it would be that if you have a way function with maybe not a fatale and just being like the localization of the way function exactly so maybe that would be a relevant set of samples there's something we I just opened this question I didn't answer it you want one quick comment related to this so you're right but in some sense you know with NQS the situation is very much like what you have in reinforcement learning where you also have like a policy and the policy generates samples itself and then you use these samples to update the policy right and in in in this setup typically when people talk about over parameterization then what they believe is that there are multiple ways to reach in this case the ground state so there would be like various configurations of the ways and biases that are distinct and that can give you the ground state so in other words you can have many ways to parameterize the same physical state with the architecture so that would still be like a form of over parameterization so in some sense you have the bottom of the landscape and then you can reach that bottom along various paths but it's not always the same bottom so it's the same physical state but could be different weights and biases right so let's thank the speaker again