 I think we can, I still see people arriving so I was waiting but maybe yeah I don't know let's start people it's slightly arriving yet but I see a small drop of people arriving too okay now again start slowly yeah let's slowly start hello to everybody this is the last lecture of this bunch of this session let's say of the high cycle guide to the condensed matter and statistical physics and for us it's a pleasure to have here Juan Carasquilla sorry I should do it and Juan Carasquilla now it's a faculty member at the Vector Institute and it's also assistant professor at the University of Waterloo read it correctly he's working in the at the intersection between condensed matter, quantum computing and machine learning she's there in the middle he's there in the middle and he's a former diploma student of the HTTP he's a success history for our diploma and also a former study student of PhD here at CISA right always interested from that I think he has moved and starts doing several post-docs in the US and now he's at Waterloo so please Juan whenever you want you can start okay thank you Alex for the introduction yeah so it's a great pleasure to be at ICTP to be back at ICTP although this is an online talk I'm always very happy to come so yeah so I want to tell you about recurrent neural networks and how we've been using them for many body physics so the first part is going to be about basic notions and yeah let's get started and so here my people funding my research and the Vector Institute on University of Waterloo so what are recurrent neural networks and why they're useful and why we think they're powerful tool to study many body physics so let me tell you first what they are so some generalities about what they are so they're a family of neural networks that are naturally suited for processing sequential data so this is something this is how people talk about these RNNs but you can process any data basically but in principle there's naturally suited for processing data that comes in a sequence or in an order right and so the sequence that I'm going to be using throughout my talk if you want is like a sequence of say spin variables sigma 1 sigma 2 through sigma n and you can imagine them they're kind of like placed on a line if you want okay or you can think of this index I one two three two n as time for instance and so the idea is that recurrent neural networks can typically scale much better to longer sequences than would be practical for other neural network architectures that you may have seen already last couple of weeks like the restricted possible machine which I'm sure you you've seen last week and so on so there's this specialization for sequences with the RNN or for the RNN and so this is part of like this if you want this conceptual idea that you can by exploiting the structure of the problem that you can make progress computationally right like and so we're going to be using this idea a lot and people in machine learning use it all the time right like that when you have a problem you want to see what's the structure of the problem and you use the information about the problem to solve it right and this is something that we in physics do all the time we use symmetries and so on and this if you want enriches the our understanding of the problem but also it saves us if you want computational time and so we do the same machine learning people have been using this idea in machine learning as well so most recurrent neural networks come process sequences of variable length that's important so this sequence that I have here the vector sigma can have say either two spins or n or n plus one and this architecture process all those variable length sequences that's important like for instance in language translation and in yeah in generally natural language processing which is one important task people in machine learning are interested in so what to do so we to go from a multi-layer neural network such as like the rvm or the restricted Boltzmann machine that you have seen last week I believe what you do is you share parameters across different parts of the model okay so and I'm going to explain what this is in a little bit and this is also what people use in in computer vision and in another famous architecture called convolutional neural network okay which is right here as a cnn there's this idea also very popular in physics I'm in machine learning which is share the parameters of the of a model as you say process things over time or spatially right like that you share you share the parameters and this makes the parameters like or the neural net more compact but also more efficient at processing some sort of data okay and so if we compare this so-called feedforward neural networks such as the restricted Boltzmann machine so what you have is something like this where each input say sigma 0 sigma 1 sigma 2 sigma 3 is a process with their own set of parameters right like so for instance I have highlighted here sigma 0 and sigma 3 and they as the as the information about these variables is processed through the architecture in this case vertically you process sigma 0 with its own set of parameters right like which I denote here in yellow whereas sigma 3 is processed with a different set of parameters here in purple okay and so that's that's how you do it in a feedforward neural network but in a recurrent neural network you do something a little bit different right like you you process sigma 0 with variable w for instance I'm going to be explaining all these thoughts in a little bit more detail later but but the important thing I want to highlight here is that you process the sigma 0 with parameter w and then you process sigma 1 and then you reuse the same parameter w right like and and so on and so forth for sigma 2 and sigma 3 so each input is processed through the same parameter say u and w and you like where you are those here I'm going to be explaining this in detail this architecture this is this blocks here the so-called recurrent cells and I'm going to be giving you examples of those or one example of those and and so this if you want this parameter sharing is what makes this RNNs powerful and compact and also enable you to have this variable length so that's how you say do this conceptual leap of going from this more traditional feedforward neural network into the RNN with parameter sharing so I'm going to I'm going to use recurrent neural networks as models for probability distributions so this is not the only way you can use them you can use them in any any other ways but this is if you want what I'm going to focus on and so let's get started like with the definitions and the math so consider a probability distribution defined over a discrete sample space so the sample space is sigma where sigma are collection of variable sigma 1 2 3s to sigma n and each of this sigma n can take d new values so it can go from 0 1 2 3 up to d new minus 1 so that's if you want a spin configuration if sigma like a spin one have configuration for instance or spin very easy variables sigma where like you have 0 or 1 or plus or minus 1 but you can also do like higher dimensional systems so the probability of serving sigma p sigma is just this quantity here p sigma of sigma 1 sigma 2 through sigma n and so one important thing that I'm going to be using is the so-called chain rule of probability which allows you to write down p of sigma in terms of its conditional so this is very this is general this is exact so it's p of sigma 1 times p of sigma 2 condition on sigma 1 and so on all the way up to p of sigma n conditional sigma 1 sigma 2 all the way through sigma n minus 1 so this is important and is the basis of how we build the r and n okay so I'm introduced this short hand notation for the conditional sigma i condition on sigma 1 through sigma i minus 1 as sigma i condition on the sigma less than i okay this is a short for conditionals so here like there's one important observation which is if you wanted to specify every conditional in the chain rule so this gives a full description of the of any possible distributions of this type okay however this characterization is very expensive in general that representation either directly writing all these probabilities or writing all the conditionals is our presentation that grows exponentially with the size of system n meaning that if you specify all the possible values for the conditional distribution this is an exponential number of it's a big table with an exponential number of entries and so we there's no hope we can use it for practical purposes right and the question is can we alleviate this explosion and the idea then is yes we will and we'll we'll exploit the structure of the problem to alleviate this complexity which is something that we in physics we do pretty much all the time right like for instance in in tensor networks what you do is you have this exponentially complex complicated wave functions and then what you do is you write an approximation in terms of say matrix product states which if you want alleviate that complexity so what we do here is the same right like or is similar in that we're trying to alleviate this exponential complexity by exploiting some structure of the problem right and the idea is that nature is benevolent and real world problems have enough structure that we can use it like when we can use far fewer resources to solve these problems or to tackle these problems right and so RNNs what they do is exactly that which is they can parametrize this probability distributions piece sigma entirely through the conditionals right like so basically what we do is we are parametrizing each of the conditionals in the in the chain group here okay so that's what the RNN that's in this RNNs do many things but this is if you want the the one way I'm going to be explaining them so so then the elementary building block of an RNN is it's a called a recurrent cell which is if you want helps us specify the condition okay and the simplest recurrent network people call it vanilla RNN because it's the simplest it's just a non-linear function that maps concatenation of a BH dimensional vector with with an input vector sigma that represents the if you understand configuration or this variable up or down or one or two and then you map this concatenation through a non-linearity and then you end up with a new vector h m which is this here it's also dh dimension so that's all it is it's a non-linear function applied to a linear transformation or a fine transformation on on this state vector or hidden vector concatenated with a okay and that's that's all what that's all it is and f is a non-linear activation function typically a sigmoid function and then what are the parameters of the RNN so it's a weight matrix w with dimensions they're usually real but you can also use complex numbers so it's dh times dh plus dv which is so the h is the dimension of this hidden vector and the nu is the dimension of the input okay and then the bias vector is r it's in the rdh so the dimension of the hidden state so the state so we so we're going to be using this expression as a recurrence you go you see from you go from a given vector h n minus one to h n so this is our recurrent relation and so we have to initialize h zero and sigma zero to something to get the calculation going and we initialize them to zeros but any constant vectors work and then something that I didn't specify is like I have this sigma in bold I don't know if it was difficult to notice but it's just a vector that represents each spin configuration and what these are is they're so-called one-hole encodings of the for instance so if you have some input that is either one or zero then I'm going to replace that by like tiny vector with dimension two where one for instance where zero corresponds to one zero and one corresponds to zero one and so that's if you want for convenience so so then so how do we proceed so we proceed computing all these conditioners that I was mentioning in the chain rule through this expression so each conditional sigma n condition on sigma one through sigma n minus one is a dot product y n times sigma n this one-hole encoding where this dot is just a scalar product and this y n is yet again an extra layer in the RNN and I'm going to be explaining how this looks like graphically to make things a little bit more fun so it's a softmax so called softmax layer where you take this recurrent vector h n you add this vector and then you apply this nonlinearity called softmax and so the the softmax activation function is given by this exponential of the components of whatever you put there divided by by the sum over the vectors so that it is normalized when there is a question related with the previous slide I think we can stop and we can go to the can you elaborate the condition on the dimension of the weight matrix why is that it is that much specific let me see so on the condition on the weight matrix so yeah so yeah so we have this weight matrix here and um so what what it does is basically it contains the parameters of the model and the dimensions of it is they're related to the power of the model right um so it's uh it's it's here so the power of the model if you're on the expressive power of the of the model is encoded in this dimensional uh like the dimensionality of the hidden vector um h which is this uh d h here okay and so that this is tied to the size of the matrix and so the dimensionality of um of this if you want this weight matrix is uh is this one here and is tied to the to the expressive power of the model I don't know if that um answers the question maybe we can allow the Rajiv uh to to be more specific hello hello hi yeah you and you answered my question thanks a lot yes I got my answer okay great okay and then there's a question by Shionee so I was wondering where or how does the fact that the hidden layers are connected uh the hidden layers enter the activation function let me see so the the hidden uh layers uh enter in the activation function uh through through this expression in here too so the hidden state of the of the model is h right and uh you apply this non-linearity to the concatenation of h and the and that's how um you process information in this architecture and uh so yeah so it goes through yeah indeed through the um uh I mean this is how the how the uh activation function uh is connected to the non-linearity is through this recurrent relation okay hope that answers the question um thank you all for the questions um so yeah so I was here I was at this uh extra layer of the recurrent neural network uh which is the softmax and the softmax ultimately provides the probability conditional right because so this expression as long as you uh and see are real and as long as everything is real then you have um that uh this s the softmax layer gives you a conditional distribution or a probability distribution that we interpret as a conditional okay uh and that's how you get this conditional distribution um and then ultimately you have done all this like the computation of all this conditional sequentially what you do is you can then compute the probability of the entire configuration sigma by uh multiplying so the chain rule had had all those multiplications right like all these conditionals come multiplied and so that corresponds to this multiplication here uh so there's uh three important things that um that I want to highlight that I think are very powerful about this model which is so one of them is the fact that p is normalized okay as opposed to for instance uh like energy based uh models or probabilistic models based on uh on an on an eastern Hamiltonian where the you need to compute the partition function if you want to get say p of sigma which is very challenging for some of this model so in this case there's no issue you can uh I mean the model is normalized which is I think a very powerful thing to do um and then sampling the probability distribution is achieved in a sequential fashion so you sample each conditional uh sequentially and then at the end of that process which you do say n times then you are um guaranteed you're have an exact sample of the model okay so there's no need for um say there's no need to perform more Markov chain Monte Carlo simulations to get samples from this model which is again very powerful I think um and so I wanted to highlight those two so this is a graphic representation of the RNN so here is the input uh you send the input to the cell which is this expression here that I was referring to um you compute this recurrence and then you use this vector hn to compute the conditional using the softmax and at the end you get this nice conditional this parameter section of the conditional this is uh if you want a graphic representation of the fact that this cell is used on and on so you use you take this hn minus one you send it to the recurrent expression and then you get hn so this is a graphical representation this is the most compact but there's also an unrolled version of this cartoon which is you initialize at h0 and sigma zero you do this uh little calculation in terms of this recurrent uh recurrence relation and the softmax which gives you the conditionals you get p of sigma one you then use sigma one to compute p of sigma one conditional sigma sorry p of sigma two conditional sigma one and so on so this is kind of like an unrolled version of the recurrent neural network where each box is either this expression with the corresponding hidden vectors hn minus one and hn and the parameters of them and this is how you like if you want how you compute the this distribution right like the sigma you just take for instance if you're computing the first term in this expression what you use you take this um y1 they've cut the computation of the first step of the rnn and you multiply it by sigma one um and and so on you keep doing this one and on until you get all the distribute like the value of the probability of some configuration um maybe this three question maybe i can answer them right now so i was wondering if the activation function related partition function and what so that so i also the question is i was by joseph so i was wondering if the activation function um relate to partition function and why and i think so it does right it relates to the uh the partition function because when you uh compute the partition function it's a sum over all the possible values of sigma and uh that gives you one that by construction and so um let me see so okay the model is normalized by construction so the partition function is not even there but um if there was a partition function you could write it in terms of of the activation function that's what i think but but the partition function is i mean the model is normalized so there's no partition function per se and the second question is do you apply any cutoff on the conditional probability okay so this is so this is a good question so do you apply any cutoff on the conditional probability say p of sigma n um is equal to p of sigma n condition only on say a few of uh the variables and the answer is uh so in in in principle no but um but because uh i mean we never apply this this so-called mark of condition so we don't use it to in the definition of the model but in practice um as you process information this vector h uh that is passed uh between the different steps in the unrolling of the RNN um i mean this can only carry some a certain amount of information this is why this vector is called memory vector and so in practice what happens is that uh if you have a dimension dh of the hidden vector that is not too big then practically uh the correlation between one variable uh sigma say sigma one and the correlation between some sigma i uh so one plus a then becomes weaker and weaker okay um becomes is actually i think for most architecture it decays exponentially so it's similar in the sense um yeah so it decays it decays right like it depends on how powerful the model is but in practice there's some cutoff like it's just that this cutoff is not imposed by hand by using some of these marker conditions that uh magmoot is uh talking about that's a good question um so maybe i can answer one more so the statement uh that the samples aren't correlated is really powerful how does this relate to numerical position when calculating unobservable uh in comparison with Monte Carlo methods so this is a good question so the idea is that uh when you do Markov chain Monte Carlo uh you have some autocorrelation time uh that you have to if you want uh use or to explore such that when you compute expectation values uh using these samples out of Markov chain Monte Carlo uh are correct right uh and uh the idea is that if you don't explore that and you use correlated samples then you start introducing a bias in the expectation values of your um of of your quantity of interest what i'm saying is uh in here this um autocorrelation is zero basically and so you you you you know that uh given the architecture you have given the model you have this um expectation values are unbiased right uh whereas in Monte Carlo in general they are biased if you uh because i mean guaranteeing like that you're not using correlated samples is difficult though in practice what we do is we do some sort of beaning analysis that so that we make sure that the samples are really uncorrelated but but what i'm saying is that um you just don't have to worry about this type of analysis anymore because the samples are um are uncorrelated so that's a good question there's also one question what condition this becomes unstable as non-linear uh signal is being considered i'm not sure i understand the question can you speak uh Sriharsha yes here in this scenario you are trying to use the sigmoid function right which is a non-linear signal so at which condition this becomes unstable so there uh if you are considering like dynamical systems or uh in this case like uh there will be like some stability at what point of time this system is stable and what point of time this becomes unstable what is the condition uh so i don't have a good answer to that question so but i understand so this makes sense or not i'm not sure but i just wanted to give a try um whether it is correct or not no it is it is correct and uh um so i think so i don't know for examples where this uh happens so i understand where you're coming from this can be understood as a dynamical system but as far as i know for this uh systems that i know um every time uh so you can also think of this as a um as a map right it is a map right like and and so what you like uh typically what happens is you float towards some fixed points that's what like what happens in my experience is you always float to uh either a or b point or c right like so i've never seen instabilities right like what i see is some form of saturation in this um in this models when understood as a dynamical system okay thanks so this uh this area still needs to be explored right this point of concept has to be explored i think there's some work on this along like the um like the dynamical properties of r and n but i'm not very familiar with it but in practice i've never seen um um because everything is properly normalized there's no i mean there's no explosion uh of anything in here okay this one will be in periodic sense or this one will be non periodic any kind of logic is it's not periodic as far as i know okay thank you yeah no problem so there's also a question so um can sigma be a matrix i think so it can be a matrix you can uh indeed reshape i i mean anything can be a matrix in some sense but yeah it can be a matrix all right let me go ahead that i'm running out of time so this is um how we compute this then how you sample is a little bit similar right like so what you do is you initialize h zero sigma zero you compute the first conditional p sigma one then you sample this you can use a random generator and this is just a sampling either up or down and you're given the probabilities you sample you get sigma one like the first sample sigma one you bring it here you input sigma one here and you compute p two sorry p sigma two condition on the sigma one that you observed here you sample again this is again two numbers uh p of probability of up and down if this is a two-dimensional system you get sigma two you store it you bring it here sample p three uh condition on sigma two and one you get sigma three you bring in here you keep repeating this on and on and this is how you get a sample so it's very easy to and uh it gives you exact samples that's how i'm saying um so that's it for probabilities let me quickly go through uh how can you send this to quantum uh many body systems okay which is uh the next uh part which is r and n wave functions uh so there's an import so let me first tell you about a first about an important class of um so called stochastic Hamiltonians or stochastic many body Hamiltonians they have ground slates uh sigma of sorry psi with strictly real and positive amplitudes in the standard computational basis sigma so now i'm promoting sigma um to uh quantum uh basis if you honor uh basis set and uh so now they're this uh cat and so you define um the ground slate of us stochastic Hamiltonian uh like in this form and you can interpret that uh ground state as the square root of a probability distribution without uh uh but like when you restrict uh two Hamilton to ground states of Hamiltonians that are so-called stochastic Hamiltonians and so it is natural then to use the recurrent neural network to to represent this wave function right so what you're doing is kind of like a coherent coherent superposition of um the r and n and that's our r and n wave function which we uh explored in this paper by my value student Muhammad Hebat Allah so however wave functions are complex in general and so you we need a phase and um there's a simple way to introduce a phase so let me let me tell you about about how we did it so what we did was so this is the usual uh the r and n that we that I explained a little bit earlier where we just make a coherent superposition of the probabilities or the square root of the probability so you can define a wave function or this wave function through the usual r and n however if you want to have a complex value um wave function what you can do is you can add an extra layer on top of the softmax so here we have the softmax layer which is what we had originally now we have a so-called soft um phase or soft sign uh layer that computes a phase or a phase given for every conditional okay and that's what we've done here so it just adds a few more parameters to the r and n and and then you use those parameters to estimate some uh phase uh phi 1, phi 2, phi 3 all the way through phi n and then um let me give you the details so the soft sign layer we call it soft sign is basically this uh this expression here you use uh you multiply by pi times this um soft sign function which is basically this expression here it's x divided by 1 plus absolute value of x so this is between minus 1 and 1 and so if you multiply it by pi it looks like a phase and and then the overall phase of the wave function is just the sum of all the phases at each site or at each spin okay so now let me uh I think in the last few minutes I have let me tell you so that's all about the architectures and the basics of the r and n but let me tell you how to train these models and I'm gonna give you hopefully two or three examples of how you can train this so so in machine learning which is that if you're under most the traditional way people do it in machine learning is you take a big bunch of data from a probability distribution you observe in nature such as like images on the internet or a collection of words in a book and then you can do the so-called or use the so-called maximum likelihood estimation or maximum likelihood principle and so what you can do is you to estimate the parameters of the iron we can use this principle this principle is very simple it's um but also a very deep idea so the parameters of the statistical model which is our recurrent neural network are selected by assigning high probability to the data you observe so this makes sense because you you you imagine you're sampling some probability distribution for instance the probability distribution of natural images so the images you take with a camera and so the what you get out of the camera it's uh I mean it has high probability of occurring because because it happened right like in nature so that's the idea that when you sample experimentally from a probability distribution the observations that you see have high probability and so we should assign them if you if you're feeding that data to a probabilistic model we should give them high probability right that's the idea that's the the principle of the maximum likelihood estimation and so you you're given a data set sigma hat which is a collection of data points sigma n and then what you do is you compute the probability of that data under the model and that's called the likelihood okay so and you assume that this samples you got in the data are uncorrelated okay and so the probability of observing that data set is the product of the probabilities assigned by the model p data so p data is going to be the RNN and where the parameters of the RNN are encapsulated in this variable data okay so that's the idea so you can maximize this quantity with respective data however it is so since probabilities are between zero and one if you multiply too many of them this number this likelihood function is going to be eventually very small okay and so what you do is you say okay instead of using the likelihood function I'm going to define for numerical convenience the negative logarithm of the likelihood okay which is um simply you take the minus the log and then you end up with this expression here okay which is way better to handle numerically and instead of maximizing the likelihood because you put a minus sign here then you minimize this negative look like okay and what you do is you use gradient descent techniques which you saw I think last week okay so so what are the ingredients of this optimization or this minimization using gradients so first you have to compute the gradient the data of this negative log likelihood and then you to do that you use some like algorithm called back propagation through time okay which is basically using the chain rule on that function defined by the RNN okay so I have these slides at the end but I won't have time it's very clear to me that I won't have time and so let's just leave it like that but I encourage you to take a look at this link it has a derivation of this back propagation so basically the derivation of the gradients of this the RNN with respect to its parameters and this is of course by a colleague of mine Roger Grosser very very clear and then what you do is you do a parameter update following this this the direction of the steepest descent which is basically you replace theta but by theta minus some small parameter alpha times the gradient and you iterate this until you get a convergence okay so the hard part is using this derivative using you know optimization because they can explode or vanish and this is related to the question about the stability so there's no problem in calculation of the probabilities but the gradients can become very high or very small and this is if you want a consequence of like understanding this system as a dynamical one and then for the one of you of the gradients there's some issues and so you need to optimize the architecture so that this is improved and then finally we're going to be using as a different method of training in physics we were often given Hamiltonian right like we're not given data but we also have we often have the so-called Hamiltonian and so we can use variational Monte Carlo to optimize these answers with respect to some local Hamiltonian maybe before you continue because there are questions related with all the learning stuff with maximum likelihood yeah let's so maybe before I go on with the question so how much time do I have do I have it's already a break now right or in a few minutes no not necessarily I mean okay I think you have still three okay great okay so then then I'm gonna have enough time okay awesome so let me go ahead with the question so so as Lavie Kumar is asking as you mentioned RNN is concerned about sequential data can we determine the time dynamics of the system for instance wave functions sequence in time so I think I'm not sure exactly what the task is here but yeah so you can determine some time dynamics in some systems and this has been explored there's a exploration along those lines of like using RNNs to explore real like some time dynamics right so let me go ahead with a different question so Sandeep is asking during sampling is it necessary to take account of fluctuations in sampling because sigma one is fitted into next step sigma two and as you have said there is random probability sampling so yeah so you so when you're doing sampling yes you account for the like the probabilistic nature of what you're trying to do which is sampling so actually so you output this say for instance p of sigma one in the first step and then you do a sampling step so you roll a coin or you roll a die and then depending on the outcome you can get either up or down and then you take that and you bring it back to the RNN to compute the condition the p of sigma two condition on that outcome okay and so if you repeat this multiple times you would see sometimes you see up sometimes you see down and so you would have accounted for this some fluctuations that Sandeep is mentioning is there some stochasticity in gradient descent so in the so in principle in gradient descent depending on how you on your cost function on the training style there's going to be a stochasticity so if you have a data set that is too large then like you you would say for instance instead of taking the entire data set to compute this log likelihood you would take only a fraction of it and use that to estimate the gradient and so as you change the different pieces of data then there's going to be fluctuations due to the to the fact that you're not using the entire data set but if you use the entire data then there's no stochasticity in maximum live estimation but if you're doing variation Monte Carlo then there's going to be always there's going to always be stochasticity because because you cannot sum over the entire Hilbert space and so you will have stochastic stochasticity in the gradient so that's a good question thanks Valentin will it work for any kind of Hamiltonians or any or only local Hamiltonians so yeah so this is for variational Monte Carlo this is meant only for local Hamiltonians if you have a non-local Hamiltonian then there's an interactability in a calculation of the gradients and the calculation of the expectation value of the energy so yeah so it's local and then second question how the dimension of h define physical reason so the dimension of the RNN is so-called hyper parameter it's a parameter that you fix by hand and what people do in practice is the higher the better because it makes the model more and more powerful okay and the physical reasoning is because this h is the if you want is the mechanism through which you express correlations in a probability distribution or in a wave function and if you have a physical system with strong correlations then you want to make this dimension high whereas if you don't have any correlation for instance if you have a mean field theory then this dimension can be zero and this wave function it becomes a mean field approximation so this if you want that's kind of how a reason about it like if you make it zero then there's with this probability distribution it's a product distribution or in the wave function case this is a mean field theory and as you make this dimension higher and higher then you account you're capable of accounting for stronger and stronger correlations uh mac mood I mac mood is asking sorry I did not understand what approximation makes RNN behave linearly with uh data size um I don't understand the question I don't think I made the statement that they behave linearly with data size but if you want to ask the question or clarify it you're you're welcome to mac mood Hi in the previous lectures uh reducing the size of uh Hilbert space by a symmetric constraints or uh locality other locality constraints we decrease the size of the data we need to run our calculations I don't know what type of this approximation can be used here my previous question in this post was the cutoff you apply on conditional probability was in this respect and I don't know I don't understand the point here thank you yeah so we we don't apply cutoffs and the approximation that we make is that this like this sorry recurrent on network representation cannot represent any conditional unless you make this dimension of the hidden vector exponentially large which we want to avoid okay so if you made that dimension of the hidden vector very large then you could represent in principle any uh any probability distribution p sigma okay so the approximation is that we by restricting the amount of correlation or the size of the hidden state you impose restrictions on the type of conditionals you can represent so that's what makes the RNN cheap uh but also less powerful than a generic approach where you do uh tables for instance which is exponentially big um there's a question by team um so for a sequence sigma one through sigma n do we assume mark mark a property I answered that question uh earlier I think so there's no mark of approximation so there's an effective mark of approximation um by cutting like putting a fix a finite number in the dimension of the hidden state which limits the type of like the distance over which you have correlations but in principle you don't do any mark of approximation I think this is a very important question but there's no mark of approximation effectively something like that happens but it happens naturally because you cannot account for long very long distance correlations um there's a question by lavi so just to understand better the RNN approach you showed here is for the sequential data of the system for instance calculate free energy as the temperature is decreasing then we determine the ground state is it right that means another way to determine the ground state so it depends so are you talking about classical the ground state of a classical Hamiltonian or the ground state of quantum Hamiltonian and in both cases you can do this okay you can do ground states with zero temperature of quantum antibody systems but you can also use these approaches to compute the ground state of a classical Hamiltonian which is an example that I have prepared for the next next session like the for the second part of the lecture okay okay so let me so I won't go through too many details of the RNNs with variational Monte Carlo I'm gonna assume you have seen this and I think you have last week you can use variational Monte Carlo to estimate wave and a ground states of many body Hamiltonians and you do that by simply computing the energy and its gradients okay the gradients are they have this form it's not too important it's just we're going to be using the same technique gradient descent on on the energy of expectation value of the energy of a Hamiltonian a local Hamiltonian and then we use the simplest parameter update that are more complicated update rules like stochastic reconfiguration that I think Lepovic and Pini explained last week and so on but let me conclude with this so which is a beautiful example is related to a question I just thought about whether you can get ground states of classical Hamiltonians and can you use this to say like for statistical mechanics and so there's a way to do it and it's described in this very beautiful paper by Wu Wang and Zhang in Geralt last two years ago so you can apply a variational approach to statistical mechanics okay so what you what this says is you have a model distribution p data that approximates the Boltzmann distribution at that finite temperature t okay so you may wonder why why do you want to do this but let me just bear with me so and what you do is basically you try to like have this probability distribution p match the Boltzmann distribution and you can do this by optimizing the free energy of the probability distribution p data and you minimize it with respect to data okay so what is this free energy is expectation value of a classical Hamiltonian or over this distribution p data minus t times the entropy of the distribution where h is the classical Hamiltonian and the entropy afraid I didn't write the expression is that it's basically sum over the spin configuration sigma of p times log p that's the entropy of the distribution it turns out that if you use a recurrent neural network or any autoregressive model or any normalized model such as the RNN both the energy and the entropy are very easy to compute okay so then you can use the the same strategy to to approximate statistical mechanics problems okay so so this is how you do it so the free energy can be estimated from samples and I explain how to obtain these samples from the RNN and and so basically you take a ns samples and then you compute this average over the so-called local free energy okay which is basically um f log is equal to h so here I have the word target because of something I'm going to talk about but h say classical the energy this energy of the classical configuration sigma plus the log of the probability p of sigma and as I said I explain how to compute p I like it's just this sequential process where you compute the probability of the RNN of configuration so this is easy to compute actually and and then both the free energy but also the gradients are easy to compute okay it turns out that the gradients of the free energy uh you can compute using samples from the RNN so you take ns samples and then you compute this one the gradient uh respective data of the log of p on the samples that you saw times the the local energy this local free energy um and then what the thing that I haven't explained yet is how you get these gradients which is again this back propagation through time that I refer to the lecture notes of order in practice we use automatic differentiation which is a very powerful algorithm that allows you to compute the gradient of any differentiable program you write so it is very easy to implement so much you don't have to do anything and then you use some simple parameter update data equals data minus a small constant the gradient of the free energy and you iterate this into convergence and this is how you solve or you approximate solutions to to statistical mechanics I think that's it for now and let's see if there are more questions there are no more questions um and so um I guess is it time for a break now uh Alejandro, Alejandro, Alex? uh well we can do our break now I think uh some minutes as you need um I I can go on it's up to you well well I think if you yeah let's let's do a small break right Tasia yeah I think we can make a break until well we have 50 minutes by the program so yeah so we can have a coffee yeah you know maybe a coffee so I see you again at 315 right yeah yeah I'll leave the everything open I guess okay perfect and then if you have questions uh you can type them in the q&a and um I'll be back in five if there are any okay okay thank you perfect thank you Juan you have a question in the q&a yeah so I it's just a very good question so maybe I can have a maybe I can answer it right now is that okay yeah why not I think I think so I mean even we are yeah so this is great so like uh hi I'm uh checking the github source for the PRL 2019 you cited at the end for the easy model and it says running this uh scripts may take thousands of GPU hours produce hundreds of gigabytes of output data so then why is it convenient to do it with respect to efficient Monte Carlo something techniques maybe only for example purposes so that this is the key question I really like this question and um so the idea is uh key idea is in the question already so with respect to efficient Monte Carlo something techniques uh it turns out uh doing efficient Monte Carlo something techniques is not easy and it's easy for the easy model but um if you uh say you want to tackle uh like a spin glass problem then this efficiency it just goes away because this Monte Carlo uh just gets uh stuck um and so the question is if you do this variational uh technique and the optimization or the approximation to free energy is um is good or is uh actually as they demonstrate in that paper this is uh very accurate uh then uh this ultra aggressive models can be sampled exactly with autocorrelation okay so then uh then at that point the question is are you closer to the ground truth by doing approximate sampling with a very slow mark of chain or with this free energy approximation or this free energy bound uh which is a model that you can sample exactly okay and the answer is for many of these problems um the approximation that you get out of this variational bound is better than the samples you get out uh Monte Carlo okay so and so I agree that uh maybe for the easy model which we know very efficient algorithms this is just an example but for uh challenging um for problems that are very difficult to sample like spin glasses and other pathological models this may be a tool that makes you um that allows you to make progress I actually I'm going to show you one example where like we're solving ground states of spin glass problems using this idea and this works better than the stay tuned for my last slide so I think we can continue already a quarter of so whenever you yeah sure let's uh let's go ahead let me share my screen can you see my screen yes okay so okay now so I'm gonna go ahead and um so it's recurrent neural networks for quantum antibody physics and I'm gonna give you two examples of research that I've done using RNNs one for quantum state reconstruction and then one for variational annealing which is that idea that I just explained roughly or not roughly but like closely related to the question I just answered which is so simulated annealing is a technique to solve a combinatorial optimization it turns out that what you do is simply use Markov chain Monte Carlo or Monte Carlo on this problem Hamiltonians and then what you do is you slowly decrease the temperature and and then you hope that at the end you find the ground state of the classical Hamiltonian which is the solution to the problem but for challenging problems there's dynamics of the Monte Carlo the Markov chain is very slow so there's a chance that we can make progress with these models and I'm gonna give you an example of that so let me start with the quantum state reconstruction which is a little more involved so oh and before uh so this slide is in the wrong place but um so I wanted to mention that uh last week we posted this paper this uh neural networks in quantum antibody physics hands-on tutorial okay and so we describe many of these techniques in that paper even the the recurrent neural network but then the cool thing is we provided code that you can play with it's very simple code but it allows you to get started with all these ideas and so on and so I encourage you to check this out it has all sorts of examples variation of Monte Carlo uh quantum state reconstruction and and so yes if you're interested you want to learn how to code these things then this is a good starting point hopefully um so anyway so let me go ahead with the quantum state reconstruction so what is learning a quantum state and what is quantum state tomography uh so it's so it's the following so quantum state tomography is the process of reconstructing the quantum state of a device by measurements so you take a quantum system uh you measure it many many times and then you try to infer what the quantum state is okay it is the gold standard for verification and benchmarking of quantum devices um and it is useful to characterize for instance optical signals it's useful to diagnose and detect errors in quantum state preparation for instance states produced by a quantum computer uh it can be used to detect entanglement and many more things and the idea is that we need to go beyond uh standard quantum state tomography reconstructions because um there has been recent progress in controlling very large quantum devices quantum systems um there's also availability of arbitrary measurements that are performed with relatively high accuracy and so the bottleneck becomes uh progress in the estimation of these quantum states and they associated uh core cursive dimensionality so this uh quantum states when we try to represent them in the classical computer the if you want they entail using exponential resources in time and memory okay so so we need to go beyond standard quantum state tomography which uses all exponential resources and this is an important for the future of quantum simulations and the benchmarking of quantum computers and so on and quantum simulators so I have here trapped ions on top uh google um uh processor d-wave it's a quantum computer it's a quantum annealer um and here I have cold atoms so these are just examples of devices you may want to say characterize through quantum state tomography so this is uh like the pace of growth this is already outdated but uh this is 51 uh atoms quantum simulator 53 uh this is 1800 so this is I'm very proud of this I um I participated in this one and uh quantum chemistry simulators and so on there's lots of exciting and uh quantum simulations mostly that are becoming available and grow and uh so can we find ways or can we do the device tools to benchmark this state preparations uh so what are the ingredients quantum system that you can prepare uh repeatedly so because as you know when you produce a quantum state or as you may have heard of it when you produce a quantum state you measure you destroy the quantum state so you you have to be able to repeat the same experiment many many times and then you need some availability of measurements that you can apply so set the measurements and then you need a training procedure and a model okay so a training procedure is you have a model for the quantum state either a full representation of the density matrix or a matrix product state or a matrix product operator or even a neural network and then a training procedure that um um if you want fits the measurements to this model okay and then at the end you need a certification kind of like a putting a stamp on the model you train which is for instance computation of the fidelity of the reconstruction with respect to an ideal state that you are trying to prepare in your quantum computer and so on so a typical tomography protocol prepares many copies of the state and they're measured in multiple ways finally the outcomes of those measurements are fit um and produce an estimate of the quantum state raw start okay that's the fitting part and that's not roughly how it goes um so here's one example of um how you do this in practice so this is called maximum likelihood estimation because it's the same spirit of what I was discussing so it requires computing some probability and you use this my likelihood function that I discussed to fit a model so what is this model it's a physical density matrix in its most general form so meaning that it scales export like the representation scale exponentially with the size of the system but then what you do is you uh you just you assume that the measurements are independent which they are um and then you compute the probability of observing these outcomes in the experiment and you maximize this probability this is the so-called maximum likelihood principle that we discussed um and then you fit the density matrix the uh to the data okay it has some issue which is exponential scaling in the parametrization and um in the handling so it's the time uh scaling of the processing of the algorithm is it's exponential but this is the most reliable tool in the sense that um uh it's the most general tool but scales poorly so you cannot apply it beyond a handful of qubits say 10 or 12 qubits I think at most as far as I know and so you cannot apply it for this large quantum simulator okay then the question is how to make quantum psychomography efficient and there are multiple ideas out there so the most um one of the most interesting ones is uh introduce parametrization of the quantum state with good scaling about non-trivial structure like for instance you can use a matrix product state of uh or or a matrix product operator and then just do a reconstruction using this as your model for the quantum state and this is these two are like one of the most powerful techniques is um based on a matrix product states and matrix product operators um however there are other approaches that we came up with which is introduce a parametrization of the quantum state with good scaling so it follows that same trend um by using said restricted Boltzmann machines or like neural networks in general right so this paper which we caught a few years ago we uh we used a restricted Boltzmann machine and we performed quantum state tomography okay uh and this was extended later by uh Jaco Motorlai and Roger Malco and they uh they were able to introduce if you want a density operator that was written in terms of a neural network um so this is uh the reference three works for pure states with a good structure and it has good scaling in terms of the resources you need to to use and then this latent space purification uh approach handles mixed states but has fast scaling so it doesn't solve all the problems um so today I want to tell you about an approach where we so instead of parametrizing the quantum state directly we parametrized the measurement statistics of uh of a measurement which is given uh basically by Bornwood and to parametrize that measurement statistics or the probability distribution of the measurements we use the RNN okay which is I mean the model that we discussed earlier today and then we use this idea to learn synthetic states uh so basically numerically generated experiments uh mimicking experimental data so that's that's the idea this whole approach um so let me tell you how to do this but before that let's let me tell you this so it works for uh pure and mixed states with structure meaning that uh it works for good so it works well for quantum states that are well represented by a recurrent neural network and it has good scaling in terms of the resources and so on as long as this uh structure in the in the quantum state now which states can we represent with an RNN still a little bit of um um open question but since RNNs are also universal function approximators there's hope that as you make the RNN powerful or more and more powerful that you're able to capture more and more the quantum states that are interesting and so on and the evidence that we have is that this RNN can represent a reasonable and non-trivial quantum states so I know I'm going to show you examples of that so let me remind you the setting so it's so we have a large quantum device we want to know if it's working as intended so we think this device can produce some non-trivial quantum states and we want to certify that the system works in some simple cases right like so we have this device we we ask the device produce some quantum state like I don't know like a matrix product state um and so we want to benchmark this preparation this is uh useful in the near term because as um quantum computers become stronger and stronger they're going to be producing quantum states that we cannot represent classically that's the hope uh with quantum computing by the so there's no hope that this is going to be working forever but that we're going to be using it to benchmark um near term quantum computers or quantum devices uh so let me discuss a little bit the theory behind what we're doing so quantum states measurements and probability distributions so a quantum state is traditionally described by a density matrix which describes the statistical state of a quantum system in quantum mechanics everything we can possibly know about a quantum state is encoded in the density matrix so what is this density matrix is a positive semi-definite Hermitian matrix of trace one acting on the Hilbert space this family forms a convex set convex set meaning that all possible quantum states form a convex set and for one qubit is the block sphere so it's this sphere that we have here in high dimensions the shape of this convex set is not known but um but it's known to be convex and uh and basically be similar to a sphere like a deep form sphere um so that's the traditional approach but can we represent quantum states with just probability that's what i want to do and because i'm going to be using the iron and in its simplest form to represent it so and the idea is you can do this through measurements that's uh that's why this is so natural for tomography right so and measurements are described by uh positive by some operators m right we're called we call them positive operator value measures and they're if you want descriptions of what you do in an experiment when you go and measure the the quantum state okay so there are this uh these p o b m's are just that represent mathematical representations of what you do when you measure a quantum device right there are collections of positive semi-definite matrices with a measurement outcome index a okay i for instance a could be uh if i measure along uh like the spin along the z direction then you get either up or down so this index a is that index i like is it is it is my measurement up or down if i measure say along the z direction and this uh operators sum up to the identity okay so you go to the lab you prepare the experiment so that you measure m and your device gives you a like they're up or down for instance so so what is the relation between experiments and measurements is this expression here is called born rule because born basically came up with it okay so you have a quantum state you're preparing the lab row you have a measurement uh apparatus that you use you measure it and then the probability that you observe a in the experiment is given by grace of raw times the operate the measurement operator the p o v m element m a and that gives you this probability okay that's the probability that you observe either up or down for instance in the lab okay that's uh that's uh if you want a fundamental link between quantum theory which is the description of the quantum state row and the description of the measurement and what you see in the experiment which is some probabilistic outcome of the measurement okay so now i'm going to use the so-called informationally complete p o v m so informationally complete measurements so and what are these informationally complete measurements they're basically a measurement or a set of measurements such that um if you measure the the quantum state with those with data apparatus then you get all the information about the quantum state and um you can think of this as as the following suppose you have like your quantum state is a i don't know a three-dimensional object and then you uh you want to understand that object right like so for you to do that you have to take this three-dimensional shape and you have to see it from different directions for you to be able to see it entirely right like to characterize its shape okay so informationally complete measurements are that thing right like are like it's observing the quantum state from all possible directions such that you can determine what it is right determine the quantum state so there's an i like this analogy right like it's like observing an object from multiple directions for instance a measurement that is not informationally complete for a 3d object is i don't know observing only along the x direction that doesn't tell you what happens behind or above and so on right that that's the idea so informationally complete measurements a set of measurements that allow you to see the quantum state in all possible directions such that you can throw a cartoon of this quantum state that that's how i i think about this or the simplest way but uh or a simple way to to think about it so information and complete means that um the measurement statistics pa which if you want the observations contains all the information about the quantum state it also means that following if you have a Hilbert space then you can uh span it with this set of operators right meaning that you can write any operator or as a linear combination of these operators m a the final meaning that i want to highlight is that the relation between rho and the probability distribution p can be inverted so here we go from rho to p if you have informationally complete measurements typically you can invert you so you can write rho in terms of p the probability distribution so that p becomes the quantum state if you okay and that's what we exploit here we select informationally complete measurements and we represent the quantum state in terms of this probability distribution p a so so this is the inversion part i don't want to go through the math but the message is the following born rule tells us that you can go from the quantum state and the measurement to the probability distribution if you have an informationally complete set of measurements you can go the other way around which is you can write rho in terms of the probability distribution p okay that's the key element and then the expression is is this here and what you what i have here in this representation is a tensor network representation of this expression which is telling me that what we're doing here is factorizing rho in terms of probability distribution that is complicated and a set of simple product of tensors that are very simple they're factorized right so all the complexity and potential interactability of the quantum state guess gets pushed into the probability distribution because that's what makes this distribution not be say for instance a mean field for instance so all the interactions and all the entanglement gets pushed into the probability distribution and that's what we exploit so what we do or the insight we had was we can create a representation of the quantum state in terms of a probability distribution over an informationally complete set of measurements m okay and what we do is we use recurrent neural networks to represent this distribution p and we know recurrent neural networks are very powerful so that's what we use right so because they firstly allow for exact sampling they have a tractable density we can compute the probability of configuration p and we can use maximum likelihood estimation for instance to learn the RNN if we have a data set of experimental outcomes that's the idea and so what we do or what we have here is a model for the density matrix rho model that is an RNN model here which I call language model is they're typically using in language processing and then you attach a set of simple tensors they're tiny okay and that's our model for the density matrix for the quantum state so let me recap what we do or what we would do in the lab and we have done it actually in the recent paper but so what we do is this we prepare a quantum state repeatedly on a quantum device here's google's processor or ipms quantum computer we perform this measurement this informationally complete measurement this gives us a large collection of measurement outcomes so it's a big data set then we fit the recurrent neural network using maximum likelihood estimation exactly what what I explained in the first part of today's discussion in the morning or in my morning then we invert this density matrix and I said I use quote unquote this is only a formal thing inverting this exactly is exponentially difficult so we don't do it but we can do it in practice if we want if we want for instance to compute them say correlation functions over the quantum state and so on or if we want to compute say fidelity is a classical fidelity and so on and then we perform some sort of certification either through fidelity classical fidelity or measuring correlate like correlation functions that are relevant for the system you're interested in okay so that's what we did so let me let me give you one example of reconstructing some numerically generated quantum states so so here is a so-called pure g8 set state so it's the so-called cat state too it's a superposition of all spins are 00000 plus all spins are 11111 okay so that's this cat state the density matrix is this one here and then we also introduce a model of noise because so this is a pure state but we also want to explore whether we can represent a mixed states right and so we we use a model a noise model and this noise model is the following so with probability p we apply an error on the quantum state where we with probability one third we apply either sigma x sigma y or sigma z on the state and with probability one minus p we do not okay so that if probability p of introducing an error is zero then we are back to the pure state but if we make p large then we have a completely mixed state okay that's the idea so can we reconstruct this quantum states using a recurrent neural network that's the question and here's like learning all so this is two qubits and we apply measurements for different values of noise and then we train this model so I like so in here I'm using a restricted bosman machine first as a first example and this is the KL divergence which is basically the difference or the distance and it's not exactly a distance but a divergence between the model distribution and and the exact probability distribution of the measurement outcome and this is as a function of training as we train these models with using maximum likelihood principle for maximum likelihood estimation what we see is that as we train the model this distance or this divergence goes to zero for the different values of noise p zero all the way through p is one okay and so we successfully train this model what we see is that training the distribution is harder for low values of noise but easy for large values of noise which makes sense because when you have high values of noise this is basically a constant distribution it's a completely random completely flat distribution so it's very easy to learn there's nothing to learn it's just one parameter so this is the easiest to learn so it goes to zero faster whereas for low values of noise it takes some effort right so this is the KL this is for the fidelity which is basically the overlap between the two distributions which goes to one as you train and then finally we have the quantum fidelity which is the distance between the quantum states itself and it follows the same trend right it goes to one as you train the models meaning that everything is working however this was only for two qubits and it was difficult to scale it to more than three or four qubits so we we said how about we try something else and we tried the RNN and for the RNN we were able to go up to 80 or 90 I think 100 qubits which was pretty surprising so we were very happy and we thought this was very strong result so that the RNN is capable of representing these quantum states in this form like using probability and what we have here is a function like a classical fidelity as a function of the number of experimental outcomes that we use in the reconstruction and what we see is that for small values of noise then you you see that as you add more and more data you you get a better and better reconstruction right you get higher and higher fidelity and what we also see is that for high values of noise you need less and less data for the same reason that learning a flat distribution is cheap it's easy right you need fewer data points the interesting thing is that for you to get a high classical fidelity say for a value of 0.95 you only need a number of samples and star that scales roughly linearly with the size of the system and so this is the number of qubits so in that sense this approach is I mean it has some scalability properties that are that are mild right in the sense that if you're trying to use this so-called classical fidelity you can achieve high reconstruction accuracy with moderate resources okay and this is in part due to the fact that you were using this RNN which is very powerful and that we can represent many important distributions with it. Now we moved on to ground states of local Hamiltonian so this was all for the GHF but then we wanted to explore if we wanted to if we could actually use ground states of many body Hamiltonians and so this is for 50 spins use DMRG matrix product state and we find that that we can also reconstruct this ground states pretty accurately so this is the orange curve is synthetic data so coming from density matrix randomization group expectation value of sigma z sigma x as a function of the site and what we see is that our reconstruction matches like the correlation functions of the reconstruction match pretty well the synthetic data okay and this is for two body correlation functions sigma one sigma i along the exact direction and again we see a good agreement then then we have a little bit more more complicated model which is the Heisenberg model on the triangular lattice which I mean it's an interesting model in that it has a complicated sign structure right like this is a model whose function is not known and and we know it has a complicated sign structure and we wanted to see if we could represent that ground state with only probability which is what we're doing and so that's why we picked this example so this is on us eight by eight lattice and these are correlation functions of the synthetic state namely the ground state of the Heisenberg model on the triangular lattice and these are the reconstructions so and it seems to be working really pretty well so yeah so that's it so using the iron so it's kind of to conclude so using the ironing was important for the success of the method because of its tractable likelihood and the fact that we can exactly sample them okay as we as I highlighted uh if you have any questions now it's a good time to discuss no questions so maybe I can go ahead okay so that was it for reconstructing quantum states now let me tell you about something newer so this there was a question so can you explain why we can get exact sampling oh yeah so the this is from Kazuki so can we how can we get exact sampling is because this RNN is constructed by parametrizing all the conditionals in a probability distribution P sigma 1, 2, 3, 2n so what we do is we uh the model parametrizes each of the conditionals meaning that we parametrized P of sigma 1 and then P of sigma 2 condition on sigma 1 sigma 3 condition on sigma 1 and sigma 2 and so on so if we have all a specification of all those conditionals then you can use the conditional at 1 to get a sample of P sigma of sigma 1 then you can feed that into the conditional for sigma 2 and you can sample that conditional and then you take sigma 1 and sigma 2 and then you compute the conditional on sigma 3 and so on until you exhaust all n spins and so since you have access to all those conditionals which is through by construction then this is an exact sample of the model and then that's why you can adjust because of the construction of the model which I think yeah it's very important okay so if there are no other questions oh there's one more this is for from Robertson how does the measurement MA comes in the example of calculating ground slates of Hamiltonians yeah so that's a good question so it comes in the following form so what we do is we prepare so we imagine we're preparing this quantum state right in a device and then we measure the we measure this device or this quantum state that is the ground state of the Hamiltonian and then we collect this statistics and then we reconstruct the quantum state so we use basically the the measurement outcomes of those measurement operators and to train okay the the the RNN model in these examples to I hope that answers the question then let me go ahead with one more a pollinario so can Fisher information be used as an alternative measure for UVM that I don't know I'll have to think about it it could be but I'll have to think about it all right let me go ahead okay so let me go ahead with the next so this is new from a couple of weeks ago so it's variational neural annealing so let me introduce this idea of combinatorial optimization so many many important challenges in science and technology can be cast as optimization problems right so there are famous optimization problems that are that we use like and motivate this but um say traveling salesman problem nurse scheduling problem vehicle rounding problems spacecraft scheduling circuit design discovery of the heat bosons all those can be recast or the data analysis of this experiments or this problems can be recast as as an optimization okay and these are computationally very difficult problems it turns out that many of these problems can be formulated as finding the ground state of a classical easy Hamiltonian that I call h target okay so this is the expression so this sigma i sigma j variables are just basically plus or minus one and this is extremely general like many many many problems can be cast as finding the ground state of this and what is the ground state of um of this Hamiltonian is finding the spin configuration sigma i um that minimizes the energy or this ground state this uh energy or this Hamiltonian okay um but uh finding this solutions is extremely hard for some problem right so there are heuristic methods to do it and one of them that I really like is called simulated annealing uh so what is simulated annealing so it kind of like uh inspired by uh old technique so it mirrors the analogous annealing process in material sign and metallurgy where a crystalline solid is heated so you warm up a piece of metal like a sword and and then you slowly cool it down and as you cool it down this piece of metal finds its lowest energy and most structure stable crystal arrangement which makes the material really durable and hard and this is actually used in when people build weapons and swords okay so and so people took in the 80s they took inspiration from this uh metallurgic technique to devise a way an approach to solve combinatorial problems of this form basically finding the ground state of an easing Hamiltonian and so uh what they did was they defined this simulated annealing because it's not real annealing you don't heat up your computer or anything you basically simulate annealing um and explore this optimization problems energy landscape by a gradual decrease in thermal fluctuations but these thermal fluctuations are generated by by Monte Carlo by the metropolis haistine's algorithm so basically what you do is you take this Hamiltonian and you simulate it using Monte Carlo okay and then what you do is you slowly uh tune uh cool down the temperature of the simulation okay so you decrease the temperature little by little until you uh you reach the temperature to zero and then at zero temperature you should be seeing only configurations that are consistent with the ground state of the target Hamiltonian okay that's the idea with uh simulated annealing it provides a fundamental connection between thermodynamics and the behavior of physical system but with complex optimization problems I think this is very appealing and a very beautiful uh algorithm the problem with simulated annealing is sampling the Wolfman distribution using uh Markov chain Monte Carlo or metropolis haisting algorithm becomes very slow for hard optimization problems as you cool down as you make this temperature uh too uh low and then this is because the autocorrelation of the Markov chain becomes very large okay so finding solutions to this problem becomes very expensive because you have to wait for a very very long time okay and this is a schematic cut to know what happens right like so what you do is so this is a simplex so the space of probability distributions um and so what you do is you start at very high temperatures somewhere here and if you did annealing at a very very slow speed you would follow that connects infinite temperature with zero temperature uh so this is if you want is the exact books on distribution all those steps and then you solve the problem exactly which like the solution is either like for instance for a degenerate problem you you find this configuration or this configuration so that's if you do it very slowly however if you do simulated annealing which is you do a Markov chain Monte Carlo this Markov chain would go out of equilibrium okay so that I represent here and then you get stuck somewhere here and you end up with some approximate solutions may be good may be bad okay now that's what happens with simulated annealing now what we're proposing is here's one idea so replace annealing uh and approximately sampling the exact moment distribution by annealing an approximate uh distribution that is close to the boson distribution like a an RNN uh that can be sample efficient so there's no auto correlation okay that's the idea uh and then um this may or may not lead to better solutions so that that's uh that's why this method is heuristic but um what we find is that um this approximation to the boson distribution that we uh we compute with the RNN is a better approximation to the exact boson distribution and so it means that you can get better solutions than simulated annealing by using the RNN because of uh the fact that you can get exact samples without auto correlation okay so that's the idea and this was posted on the archive a few few weeks ago uh and so let me let me show you one example of uh solving optimization problems with this technique but before doing that let me uh like tell you how we train this RNN so how we optimize them so how to train this RNN so that it mimics the boson distribution that's what we want to do and then anneal in temperature so we use a time dependent free energy um and then the nice thing is we'll have an energy part this is the target Hamiltonian we have a time dependent temperature which is kind of like this temperature that allows us to go from high temperature to low temperature and then the entropy that's going to be our cost function the thing that we optimize and then we start at high temperature p zero and use a linear schedule function e such that as we make t small t larger and larger all the way from zero to one we end up at the ground state of the problem and then what we do is we optimize this free energy okay so what is the algorithm we perform a warm-up at high temperature where we make the RNN match the high temperature distribution then we uh we do small time steps and we retrain the model at each temperature okay and we use the variational parameters from the previous step on the model at the next time step t plus delta t okay then at the end of the annealing process this distribution given by the RNN is expected to assign high probability to the configurations that solve the optimization problem so that's the that's the strategy um and uh so this is these are the results so let me highlight for instance figure a so this is for a spin glass problem with um I think with a hundred spin it's called the sharing torque sharing torque Patrick model and it's a fully connected spin glass problem so it's deemed to be challenging and what we find is that as we make the number of annealing steps meaning the time we take from high temperature to low temperature what we see is that the residual energy right like so meaning the energy the excess energy with respect to the exact solution to the problem goes goes to very small numbers basically around 10 to the minus six four variational simulated annealing so basically for our technique which is this thing in blue okay and it does so at the faster rate than traditional simulated annealing which is this red curve here and even simulated quantum annealing which is also a powerful uh method inspired by quantum um by quantum annealing uh actually and we find that this technique is like our technique is significantly better if you put a lot of annealing steps so this is for the sharing torque here here Patrick model and this is for a so-called whisher planted ensemble it's also a fully connected spin glass this one is very interesting it's very difficult to solve but we also find that our method if we allow enough annealing steps we find solutions that are orders of magnitude more accurate than simulated annealing as well as simulated quantum annealing and and this is again one more example of the whisher planted ensemble where we also find I mean we we find better solutions but not as as accurate as as in this two other examples and with that let me conclude and take a few questions so we introduce a formulation of the quantum state that is closer to a statistical theory because we represented the quantum state in terms of probability we use that representation to reconstruct quantum states of uh uh increasingly large sizes and so it provides a way to approximately reconstruct this quantum states and then for the final part we introduce a variational formulation of simulated annealing that produces very accurate solutions to spin glass problems that may have applications in all these areas and just to conclude I have this personal belief that there are a lot of opportunities at the intersection between physics condensed matter physics and machine learning and with that let me take questions thank you Juan so yes you you have two questions already okay um hi dear is it possible to get these lights of these courses yeah I can send my slides to the organizers and then we will post it in the web page of the of the event okay and then there's a question by you self uh could we use variational annealing to construct states uh ensembles in a quantum in case of quantum steering so I don't I'm not familiar with quantum steering so can you explain it to me maybe we can allow you sir good evening thank you Juan for the for the presentation the quantum steering is uh is a correlation between the now locality and the entanglement and the it used it's used to to secure the the communication the quantum communication I am understand and there's and there's could you understand me or I need to clarify I think I want to understand what's what is it that you want to do like I'm not sure I understand so uh what the case when we have the case of entanglement we have two trustable parties if we for example Bob and Alice but in the case of steering we have we don't trust one of them so Bob sent their states and uh or Alice and send send their states and the Bob need need to check them and need to construct them to construct the state and maybe we we use we use uh in the normed case we use uh the semi-defined programmation if uh if you to optimize the state but uh Alice is send it to Bob so yeah could we replace the state uh the the the semi-com semi-semi-defined programming by by the variational method and yeah so I think I now understand so I think that as long as you for you can formulate this problem as a as a combinatorial optimization or as a finding the the solution as finding the ground state of a classical Hamiltonian then you can use it thank you very much one yeah the question is can can you reformulate it that way and if yes then you can thank you okay there's a question from John Carlo so can you study metastable states uh local minima and free energy barriers and characterize the free energy landscape with your this variation so I think that's a good question uh I think yes right like you can try to do that the problem is uh or the problem that I see is you may um you may uh so the training so the optimization of the free energy may fail okay and uh there's there's no easy way to check for that but uh I I we with my student we think there's a way to solve uh this issue of like a drop a mode dropping which is basically missing some of the local minimum but I think the method itself doesn't guarantee that you will find all the modes of the distribution or that it's going to explore all the like the entire landscape but I know people have tried this like this perhaps this style of approaching this problem using machine learning so there's I think there is hope and there's potential to approach this type of problems with this techniques with annealing thanks okay there is a question also in the chat uh could you share references for answers for spring last solution with additional essay let me uh in the chat I did I didn't hear very well but let me okay um let me read it again could you share reference and source um so the references um let me share again so the reference we have is is this so archive 2101 uh 154 and the source we have not released it yet but we we will do it okay however this is very easy to code so if you go to the hands-on tutorial there's an RNN already there and you can change so this is basically a loop over like this simulated variational simulator annealing is just a loop over different temperatures when you try to optimize the free energy which is easy to compute and so there's plenty of code that is available to do this and even in this hands-on tutorial there's already code to do it um but uh but we'll we'll eventually release our code in here for spin glass any more questions yes there is another question in the Q&A yeah so Roberson asked I'm not sure if I missed it but how do you verify results thrown out by the algorithm for instance in the spin glass problem so and this is a good question so for the spin glass problem so for the Sharon Tompere Patrick we use uh the so-called spin glass server so there's a server I think this is in Germany in Julek where you just give them the problem Hamiltonian and they give you the answer they use some heuristics and they they tell you yes the algorithm this is the answer or like they tell you we cannot solve it either because it's too big or there's no uh heuristic to do it and so for the Sharon Tompere Patrick model we use the spin glass server and for the Wishart ensemble so the Wishart ensemble is very interesting it's a it's a fully connected spin glass problem with so-called planted solutions and planted solutions what that means is that the solution is known to you by construction so it's kind of like finding a needle in the haystack where you know what the needle looks like it's just that the needle is surrounded by straws by a big bunch of like haystack and so the the energy of the straw the haystack they're very very similar really closing energy to the energy of the needle okay but you know what the needle is so you know the solution so you don't so that so the way you verify the solution given by the um VCA but the variational simulated annealing is because you know the exact solution you're just trying to find it um in a say rough landscape of a similar state surrounding it thank you but what if what if we don't have that known solution then you just hope for the best yeah many of these problems have no no known solution I mean if we did then there wouldn't there wouldn't be problems right like they wouldn't be optimization problems you just knew the solution then but for many of these problems what you do is you try and you find the best energy you can and then you take it um and then uh yeah that's how how it works for some of these problems there are bounds for some algorithms so some algorithms can tell you you're this far from the real ground state but not all of them so in our case this doesn't tell you it doesn't give you any guarantee it's a curiosity okay if there are no more questions I think we can finish here we have plenty of time so if people want to ask you yet but if not thank you again Juan yeah thank you very much it was a pleasure uh next time you came here I hope so and that's all uh well I don't know Asia do you want to say something for concluding this series of machine learning yeah I hope it was a very useful for all the participants it was definitely useful for me much in the subject and so yeah thanks to Juan and to all the previous speakers and tutors yeah thanks for having me and yeah it was fun and yeah hopefully you see each other in the future some I learned a lot I learned a lot so thank you very much thank you thanks to everyone so thank you let's stop here yeah bye