 Okay, so welcome back to this second lecture where we're gonna describe more machine learning methods for many body physics, quantum physics, classical physics. So I didn't mention it at the beginning, but my lecture notes will be available or are already available, I don't know on the website of the ICTP. So I think that you can access them only if you are logged in or something. If you have attended this conference. And tomorrow I'm also going to give you some codes, computer codes, which will be public available. For example, to simulate some quantum systems using these ideas. So you might want also to be here tomorrow morning just to have a look at that. Oh, in the previous lecture we've seen the rule, one of the paradigms of machine learning which is the supervised learning. And we've seen some applications to, for example, classifying phases of matter. I've shown you before, right, this example here. Now what I want to discuss is another paradigm which is called unsupervised learning. So what is this unsupervised learning? It's conceptually a bit harder than the previous one. So the previous one I've told you that it's basically just fitting. In this case, plus a clever method for optimizing. In this case, unsupervised learning deals with this problem. So we have again, a lot of data, my data set, like the one I wrote before. But I don't have the labels. So I don't know what the answer to my problem should be on those selected examples, right? So it's not like before that I already know. So I already know that this phase is disordered, the one on the left are ordered or the one on the right are disordered. I don't know that in advance. So I want to find it by myself, in a sense. So this is much more powerful application if you want of artificial intelligence. So to do that, we need a couple of tools and conceptual and practical tools to realize this goal. So in particular, these unsupervised learning can be used to find the probability distribution according to which those samples are generated. So we assume that underlying these samples that I've been given by somebody, there is some unknown probability distribution that I want to determine. So the goal of supervised learning, unsupervised learning is then to devise to find, for example, an artificial neural network that I call F again, such that as much as possible, so all of these are just high dimensional vectors, such that this probability distribution is well approximated and I will specify in what sense by my neural network. So I imagine, again, that my high dimensional function or my artificial neural network depends on some parameters and here I'm also specifying a normalization. So this is a probability density. So I have to specify also a normalization for this neural network. So in particular, if I want this to be a proper probability density, I have to divide everything by its normalization, which is basically in this case just, so imagine that X is just a discrete variable, so it would be the sum of all possible values of X of F of X and P, right? It's just the normalization. Okay, so now how can I do that? Well, to do that, first of all, like before I have to define some function which measures how far we are from the target probability distribution and how well we are approximating our data. So it turns out that we cannot use the measure we were using before, which is L2 basically distance, simply because we don't know the value of P of pi. We just know samples from pi. So we just want to use this information, not the value that pi takes on those points because simply we cannot estimate and we don't know it in advance. So the quantity that is said is much more meaningful. In this case, it's called the cool back divergence. So I always cool back libler divergence, which is a measure or how similar or how far two probability distributions are. So for example, imagine that my target distribution is this one and my approximate is this one. So then in this case, the cool back libler divergence is defined in this way. So it's, so basically it's the probability of pi given basically sort of distance between pi and f. So my approximate probability distribution and it's defined for a continuous, for a discrete valued function, like the one that I'm using here, like the sum over all the possible values of x of basically pi of x. So my unknown probability distribution times the log of, so my pi of x. So the exact probability distribution divided by the approximate probability distribution. So which in this case is f of x and P. So divided by z of P, right? So I can put z of P here, okay? So there are two things that we should remark about this cool back libler divergence. So first of all, you see that if the target probability distribution pi is equal to the approximate probability distribution, okay? Then this ratio is equal to one and log of one is zero. So which means that this quantity has a minimum at zero. And the other thing is that it is not a proper distance in the sense of the metrics. Because if you invert the role of pi and f, you see that this quantity changes. So it's not a proper metrics. But still it has a right clear interpretation in terms of information theory. In particular, it can be interpreted as the quantity of information that you miss when you represent your data with this approximate probability distribution instead of the original probability distribution. So it's the information you are missing in compressing if you want your data with another probability distribution. So we want to minimize this quantity and find the minimum which corresponds to the best possible approximation for my probability. Right? So to do that, okay, I mean it's, we have to basically, so this quantity, this object depends on the parameters of my neural network. So, and this will also show you why this quantity is important. Because if we do the derivative, for example, with respect to some one of my parameters, so this dkl, then you can see, I mean that this can be written immediately. So you see, you find this. So the only thing that depends on p here is f and z. Okay, so you have a first term which is just minus pi of x. Then we have the derivative pk of the log basically of f and p. So this is the first term that is basically deriving this object here, the denominator. And then we have another term which is plus. So the derivative with respect to pk of the log of z of p, okay? But then, I mean, we can rewrite this eventually just as the difference between two expectation values. So we have the, so the first expectation value is with respect to this probability distribution pi. So you can already see it here. So it's the expectation value over pi of the derivative of log of f, okay? So this is an expectation value over this probability distribution of pi of x, right? With a minus. And then I have a plus. So you can also do the derivative of this object. I mean, I'm not going to do it explicitly, but it's very easy and but you can show that when you do this derivative, you find that the basically the same object. So the derivative of the log of f again. But the average now is not done on pi but on my approximate probability distribution. So it's done if you want on f over z, okay? So the gradient of this quantity has a very transparent physical meaning or I mean, probabilistic meaning in the sense that it is zero only for the expectation value of these objects are equal on both distributions. So you attain a minimum when you basically have that those expectation values are equal. So for all these, all those derivatives have the same expectation value. And it is also important that it tells you why we choose these objects in the first place. Because as you can see, those expectation values, for example, this one. So if we rewrite it, it can be written like the sum over x, so all the possible values of x of pi x and these objects, which typically I call the decay of x and p, okay? So this is this expectation value. So pi is already normalized, but this can be approximated like the sum over my samples. So the sum over i, if you want to ns of this decay because of the law of the large numbers. So one over ns. So it's the simple average over my sample of xi, okay? So I take my samples and I estimate this expectation value again as just the simple average of this object, which is a function I can compute very easily on those samples. So this means that never during my optimization when I need the gradient, so I can use the gradient descent, so Gaussian gradient descent as before, I need to compute or to know the actual value of pi on those points. I just need the points themselves and to compute some averages over those points, right? So the other ingredient that we need is this one. So this one, so this thing is really easy to compute. Once you know the artificial network, you just need to do the derivatives and compute them on some given points. So this other one is a bit trickier because you have to be able to sample from your neural network. So these expectation values, in other words, would be, so the idea is that you generate some other samples, so let's call them x prime of i, x prime, so x prime of one, x prime of two, okay? So in these samples, these high dimensional data are generated according with a probability which is given by the neural network probability, okay? So in order to compute those expectation values and so to compute the simple average over these x primes, you need to be able to generate those points. So you need to be able to sample efficiently if you want from the machine. So in order to do so, we need to introduce some practical scheme to do that. It turns out that the networks that I introduced before, so in particular, these deep networks, these feed forward networks are not particularly suitable for this purpose. And instead, I mean, there are others, yes. No, I didn't, yes, they are too far. The first term here, I don't have the probability pi, no, but I have the points which are distributed according to pi. So which means that computing expectation values over pi can be estimated averaging the function I want to average over those points. But this is the low large numbers, right? It's the principle of some Monte Carlo sampling. So I generate a lot of samples from a given distribution and then I compute the value of the function I want to average over those points. Yes, but I'm never using this distribution explicitly. I'm just using the points which are drawn implicitly from that distribution. So to compute this sum, I don't need the pi, that was my point before. To compute this sum, I don't need pi, I just need to know the value of x, that's it. Okay, so now let's introduce an architecture, so let's introduce like a network which allows me to compute this quantity efficiently. To sample from this network efficiently. So what I'm going to talk about are called restricted Bosnian machines. And these are somehow the, for many reasons, I believe these are the natural entry point in the world on neural networks if you come from the statistical physics or in general from physics, because they have a strong connection as you can imagine from the name with classical and statistical thermodynamics. So what's the idea of these machines? So I've told you before that we need the high dimensional function, right? So we need the sum f of, so we have this f of x everywhere, so this high dimensional vector. So let's assume for a moment that my vector, so my x takes the form of a spin variable. For example, you can imagine that you have a binary variable which can take values zero and one, which can be mapped onto itself on a spin variable which can take values plus one and minus one, okay? So we call these variables as you expect sigma one, sigma two, sigma n, okay? So again, sigma one, so this is my high dimensional vector which I'm now taking in the form of a spin vector. So each of those can take plus or minus one. So it's a spin one half, classical spin one half. Now, the idea of this RBM, so restricted Bosnian machine, is to write this high dimensional function, so this network if you want, as the partition function of a classical object. So in particular, it's written, so let me write it like that, as the sum of some over some auxiliary spins which are called hidden units. And I'm going to tell you more about this in a second, of the Boltzmann weight, so the exponential of, so we have in this classical energy, some interaction term which is equivalent to the network terms I was discussing before, which drives an interaction between my spin and those artificial spins, so hidden variables. So this is in exponential, so this is really just the classical interaction energy. Plus, then I also have some bias terms for those spins, so both for the, let's say for the hidden spins, so these H variables, and for the physical spins. Let's say physical variables, sigma i ai. So let's see what does it mean. So in this function, so in this partition function, the parameters to be determined are the following. So first of all, the interaction matrix, so this weights w, then this biases, and then this bias for the, this fields if you want for the classical, for the spins, for the spin variables, sigma. So in general, what we do here is that we take a fixed number of hidden variables. So for example, in this case, we take m hidden variables, so m hidden variables, so which are these Hj's which take the value of plus or minus one, so they are themselves spins. Okay, and m is also like a parameter that I can tune, right? Which in a sense corresponds to how clever my network, how smart my brain, my artificial brain is. So you can imagine that the more of those artificial neurons I have, the more complex my network would be because I will have more connections and the more accurate the approximations of the functions I want to determine will be, right? So this is the basic idea. So there are also representation theorems for those machines. Now, so the idea now is that so you can represent this graphically where those are your sigma spins. So sigma one, sigma two, sigma three, sigma four, for example, and then you have a couple of hidden units. In this case, I have just two, so H1 and H2. And these hidden units, as you can see here, interact. So these are classical spins. They interact with all the physical spins in this way. So why are these machines called restricted bosom machines? So they are called restricted because as you can see, we don't have any direct interaction between the spins. You see? So the only interactions which are allowed are between the sigma and the H. We don't have direct connections between the sigmas. And the reason why this is the case would be apparent in a moment. So indeed, since we don't have any interactions between the CDH, we can perform explicitly this sum, okay? Because you see, we can basically factorize this object in the form of, so we can write it as, again, so the exponential of sigma, so the sum of i of sigma i ai, which is just a common factor, which does not depend on H. And then we have the sum of all the possible values of H of the product of J, where now J, again, is the index of the hidden unit of the exponential of the sum of i, w, i, j. i, j, so h sigma i j plus hj vj, okay? So I can do that just in manipulation. And then I can sum individually each hj because these are non-interacting. So it's basically if you want a mean fit model in those variables. So at the end, you can write explicitly this function as, again, the exponential of this first object times the product. So here for each hj, we have to sum the two possible values. So we have the exp of plus one and the exp of the same object with a minus. So at the end, we have the exp of something plus the exp of minus that something so that we get twice the hyperbolic cosine of the argument. So basically, of sum over i, of w ij sigma i plus vj, okay? So we've been able to perform this summation which is in practice on the inverse space of this variable. So it's exponential analytically because of the structure of those connections. Now, we have this structure. So we know the explicit form for the F and we want to use it to sample from, we want to use to generate samples which are distributed according to this network. So how do we do that? So how many of you know about the metropolis as things, algorithm? Now, let me ask again the question. How many of you know about the metropolis method? Okay, that's better. So the idea is that we have a probability distribution probability density at least and an unnormalized probability density which has an explicit form, okay? This one. So in principle, what we can do is that we can devise a Markov chain. So I don't have unfortunately time to explain how this works in practice but let's assume that you know already what a Markov chain is and how it works. So which generates a chain of samples. So for example, it's a stochastic process which starts let's say from a given many body configuration. So let's say S1 and then transits to a sequence of other configurations, sigma two, et cetera, through some transition probability. So I have a probability of going from this to the next element through a transition probability which I call capital T of let's say in this case would be of sigma one of going to sigma two. So this is a stochastic process. So with some probability, so given sigma one, I will generate stochastically one possible sigma two with some probability and then given sigma two, I will go on and I will advance my Markov chain. So this will give a way to generate my probability, samples from my probability distribution provided that I satisfy the so-called detailed balance condition, right? So this is what these askings, these askings metropolis approach comes in. In particular, you can show that you will sample. So those configurations are distributed according to my original probability distribution. So let's say let's call it pi of sigma. If I accept my for next configuration with a probability which is given by, so let's say the probability of accepting my configuration be equal to the minimum of one and the value of my probability distribution over the new configuration. So here this is the accept of the probability of accepting a transition from sigma to sigma prime, okay? So we are here and we want to go there, for example. So this is equal to pi of sigma prime over pi of sigma times this transition probability. So this transition probability, for example, of going from sigma prime to sigma and that's a transition probability of going from sigma to sigma prime. Okay? So this is the metropolis asking the criterion. So in practice what you can do then to sample from this machine, so one simple thing that you can do is that, for example, you have your current configurations of spins, so N variables which can take plus or minus one, you pick and run on one of them, you flip it and you compute that thing, that object on the new flip configuration, compute this ratio. In that case, the transition probability is symmetric. So you compute just the ratio of the two probabilities and decide whether to accept or not this change in the configuration according to this probability, right? So you might have used this approach, for example, to sample from the classical partition function of the Ising model, right? That's something people see. Okay, so this is one possibility. The other possibility which is, however, used a little bit is more popular in this field, at least in this field of restricted Boltzmann machines. It's not the metropolis sampling, but the Gibbs sampling. So Gibbs sampling is a different idea, which is particularly useful in this case. So the idea of the Gibbs sampling is that instead of considering just this probability distribution as a function of sigma, and I assume now that also these weights are just real numbers, so that this object is just a probability distribution as it should be. And what I can do then is that I define, so I want to sample from a space, a joint space, so if you want now my x, so my configurations that I want to sample now will be the ensemble of the sigma and of the h variables, right? So now I really have an ensemble of sigma and h variables that I want to sample from. And in particular, this Gibbs sampling does the following. So it takes as a transition probability. So again, now the transition probability needs to drive me to a different configuration, x prime, where one of both, the configuration of sigma and h are changed. So typically we do this in this way. So the first type of moves, the first type of sampling that we can do is where we only change, for example, sigma. And we left h unchanged, so this is one possibility. And the other set of moves that we instead is that we fix sigma and we sample only h, we change just h. So these are the two possibilities that we can do. So in the first case, so in the case where we just sample sigma and fix h, we use as a transition probability basically, so the probability of going in this case, so max to x prime, we use the probability of, so this is basically f of sigma given h, okay? So in this case, what I do is that I basically just, so I use, sorry, this is the probability of going, so sigma prime given h, okay? So which is defined basically as the, if we normalize everything correctly, this is just this divided by, so it's just, so okay, so this is the definition of the probability of sigma prime given h, which if we write, if we do this thing in x piece, if we write this thing, x piece is just f of sigma prime and h, so the quantity that I've written over there divided by the sum over sigma prime, so all the possible sigma prime of h, okay? So the old point of this is that in this case you can compute exactly this quantity for this restricted Boltzmann machine architecture and you can do this sampling using basically this transition probability. So the big advantage of this Gibbs sampling strategy is that at the end you don't have to perform an extra acceptance step as you have in the Metropolis sampling, but you accept all the moves you are generating. So this is the great advantage. So let me, so you can find all the details in the notes so now I don't want to go too much into the details, but the idea is the following, I mean, graphically. So you have your set of spin variables and you have your set of hidden units, okay? So what you do is that, for example, in the first step you fix the hidden unit so you don't change them and you sample, so you generate values for those using this probability which you can determine and easily compute. So you freeze those and you update at once all the spins that you have down here. So it's like a multi-spin move so it's very particularly efficient. And then you have another move where you froze the hidden and you sampled the spins and you go back and forth sampling from this machine in this way. So once you've done that you can also measure then the second term that we had in the KL divergence which was this difference which was this expectation value over the my approximate probability distribution. So over my neural network which in this case is this restricted boson machine. So you are not obliged to do this. I mean, you can just do metropolis if you like. Yes. Yeah, I mean the historical reason is that because people wanted to give sampling. So if you don't, if you have a different architecture where for example you have interactions here you cannot do give sampling. But you can still do metropolis, yes. The important thing however is that you do not want to have interactions between the hidden units. If you have interactions between the hidden units then you cannot trace them out as I do here analytically. So this is the only constraint, the important constraint. Okay, so now I have all this nice stuff. I know how to sample from my machine. I know for example how to compute also those quantities that entered before. So for example, I mean I told you that I needed to compute this d of k which are basically the derivative with respect to my parameters of the log of my function. Okay, which in this case depend on sigma. So also in this case we can compute that for example, I mean just let me write down something. So for example, if you derive with respect to this visible bias, so this is just sigma i, this is very easy. If you derive with respect to the weights, W, i and j, then this is just equal to the hyperbolic tangent of the sum of, sum of j, i, j, basically W, j, sorry sigma, some more. And I mean, et cetera. So you can determine, I mean you can just do the derivatives of that quantity with respect to, so the log of that quantity with respect to all my parameters and derive analytically all these quantities, okay? Okay, so I can then use all this machinery. I can implement it also if I like to compute the second term that I had in the, in the, in the, in the cool black library divergence. So I remind you that, so the derivative, so the gradient of this KL divergence was given by, again, so the expectation value. Of this d of k of x, so again x of p over my approximate f LBM minus the same expectation value of the same object. So again, for example, this pin or this other quantity over the true probability distribution. And again, I can use my Markov chain, so the sample that I've generated from my LBM to estimate this expectation value as a simple average over those samples, right? So again, as before, I can use the stochastic gradient descent to perform the optimization of this object. And here it is even more interesting, I believe, because what people do is that, again, they define a number of, of, of elements in my, in my batch, so, so I take a smaller element than the total number of samples. Again, for example, to compute this expectation value over p over pi. And typically one takes the same number of, of elements to sample from this object. So I generate the same number of, of points distributed according to, to my approximate probability distribution, and also to estimate this difference, okay? And one of the most remarkable things is that this approach is particularly effective for NB as small as one or two, right? So this means that you can just take at random one or two samples and estimate the gradient in a way which still will, will drive you towards the minimum of the function. So this approach, which is, which is extremely noise in practice, in theory, but in practice work very well, is known as contrastive divergence. So the fact that you take NB one, for example, you have a strongly biased estimate of the gradient, but still this, this estimate will drive you, will allow you to converge relatively quickly to the, to the ground state. So the minimum of this scale divergence, okay? So at the end of everything, then you have obtained the best parameters you have in your network to estimate this unknown probability distribution. So the question is, is this useful to do something, for example, in physics or in other fields? So, of course, yes, it is extremely useful. For example, you can use it to forge handwriting, right? So this is an application which is quite fun. So let's imagine that my data set is a lot of things that I've written, like for example, I've written a lot of text and my X is the things that I've written. So again, the image is on my head writing on a lot of text. So here, for example, there, we have three examples of things that three different people have written. And you can imagine that for each people, we have a collection of letters, of written papers that each of those peoples have written. So this is my large data set, okay? And you can imagine that, for example, each of those correspond, each of those persons has a different pie, right? So for example, I have a certain probability of writing the four in a way that I normally do. I do not always write it like that. Sometimes I write it like that or like that, but so there's a probability for me to write a certain letter in a certain way, right? And this probability, of course, is a mess. We don't know it in advance. We can only estimate it through those techniques, for example. So what you can do then is that you learn this probability, which is crazy. And you can write whatever you like. Just you can have this person write, for example, that is giving a lecture in 3ST today, but probably they don't even know where 3ST is. Or you can write that they owe you a lot of money and this kind of stuff. So it's also potentially dangerous. You have to be careful. And you can go, for example, on this website where they show you other interactive examples or where you can write your own text and it will be generated in that writing style. This is a general application. The other application is, for example, if we want to learn the thermodynamics. So let's do this again. This is application to physics again. So I always switch back and forth from, let's say, more fancy, useless applications like writing things to more physical applications. So in this case, so let's assume, for example, that you have a classical system of which you can measure, for example, some, from which you can measure, for example, the spin configurations. So you have a classical system of spins and you can observe the configuration of those spins as a function of time if you are in a lab. So I have my two dimensional or n dimensional model of classical spins and I can observe a lot of those snapshots, okay? So each of those, again, is my data set. So we know from Boltzmann that those distribution, those things are distributed according to the exponential, basically, of minus beta, the energy. Oh, sorry, of this x, right? Divided again by some normalization. However, you can imagine that, I mean, you are in a lab but you don't know what the energy is. So you want to find the energy. So you want to find what, somehow, what is the energy that describes this experiment. So what you can do then is that you just do unsupervised learning. So you take those samples, you use a Boltzmann machine, your model, to reconstruct this probability, a non-probability distribution and from that, you can read, basically, or you can compute the energy on an arbitrary configuration that you have not found in the experiment, okay? So this is the idea of reconstructing thermodynamics if you want from the lab. So now the question that you might ask is, okay, but what kind of interactions can I describe with my RBM, for example, which is a very simple machine? I mean, you don't even have interactions between the spins. How can you think that you can describe complex interactions? Well, it turns out that, I mean, I can give you a theorem that any classical, any physical, classical model which has k-body interactions where k is like one, two, three, can be described efficiently by a Boltzmann machine. By efficiently, I mean with a number of parameters, so with a number of hidden units, which scales basically only polynomially with the sides of the system, so with the number of spins. So to give you roughly an idea of this, for example, imagine that you wanted to describe a term, an interaction term. So I told you that at the beginning, we don't have any interactions between the spins, sigma one and sigma two, for example, but imagine that for some reason, in my Hamiltonian, I have an interaction of this form. So sigma one, sigma two, times some interaction. So how can I describe this in the form of a Boltzmann machine? So now we are talking about the representability of those interactions, right? So the idea is that we can mediate this interaction through the addition of an extra hidden unit. So this is also a very nice interpretation, if you want, of these extra hidden units. And these hidden units can be used a little bit, a la Habard-Stradonovich, if you know. So we can insert an auxiliary hidden unit, so just one, which mediates this interaction. So we do this, and we say that this is equal to a Boltzmann machine where we have an interaction between the spin sigma one and this hidden unit V with some term W one V. And then I have also another term, which is sigma two H V W two V, okay? Then this H can take just plus or minus, I can sum it again using the hyperbolic cosine and all of that. And basically, you can find, so using the two possible values of sigma one and sigma two, so in total we have four possibilities. You can just fix, you can solve this system of equations if you want, for the unknowns W one and W two. So there is a solution, and you can make it exactly this interaction, and you can make it exactly this interaction using this extra hidden unit. So this is the idea also of this hidden unit, that they made it correlations, they made it interactions in the physical spins. So, and in particular, so this tells you that, for example, if you have the short-range ising model, you have to put a number of hidden units, which is equal to the number of bonds, so the number of interactions that you have in the, yes, in your problem, which in this case would be order N, so the number of spins. Okay, so, and this is what it has been shown in these papers, so they showed that reconstructing the, for example, the thermodynamics of the ising model, when you put the number of hidden units, which in here is called MS, equal to the number of spins, you basically manage to reconstruct all thermodynamic quantities in efficient ways. They showed this also from numerics. So let me just mention very quickly our last application, which is to quantum physics. And which, I mean, you can in principle do this same reasoning, but in the quantum case. So in the quantum case, of course, you don't want to determine the energy because you don't have a classical energy, but what you want to determine is the wave function, for example, of the system. So you can imagine that now I have quantum spins, I'm in a lab, or I have a cold atoms, if you want, and I'm doing a measurement of the density of these cold atoms of my quantum spins. So when I do a measurement in my system, you know that what I will get from, for example, for these images that come from actual experiments in cold atoms, so each of those points are just single atomic densities, basically. What I get is that I get psi square. So those points, so those spins values, if you want, are distributed according to the square models of the probability by the measurement process in quantum mechanics. So what we showed is that we can use our machinery to train a network to learn psi square, basically, in some given basis, okay? So once you can learn efficiently on very large systems also this probability, you can use this machine then to, for example, reconstruct some observables that you were not able to measure directly in the experiment. So in a sense, what we can do is that we forge some quantum measurements that have never been performed in the experiment. So we use the information we get from the experiment, or we use the wave function we reconstruct from the experiment to measure other things. For example, we can measure the entanglement entropy or other things that are very hard to measure in actual experiments. So in this paper, we have shown that we can perform effectively and one in silico reconstruction of the radioentropy or other quantities. And the idea, of course, is that it's not sufficient to learn just one basis, but if you want to reconstruct the phase, you also have to look at different basis, but this is a technical detail if you want. Okay, so this is the first application to one of the physics that I will discuss and tomorrow morning I will discuss a lot more. So thank you and I'll see you tomorrow.