 But let me visually thank all of them, especially for the effort they put in doing the program, which took a lot of the schedule of the conference, which took a lot of time. And also I put a lot of constraints and they helped me with that. So during these lectures, we are going to see a lot of some new, very recent applications of machine learning techniques to many body physics. And we'll start with, let's say, classical many body physics. And then we will dive into more quantum applications, especially tomorrow morning. Sorry about that, we have to wake up early tomorrow morning. And the thing we are going to do today is basically to introduce the field. So introduce the basic things in machine learning and to show you a little bit what are the potentialities of these techniques. So concerning the learning material, you can see from here probably. So the basic material is covered in a lot of books, good books around. The one that I like particularly that I suggest you read is this book by Joshua Benjo and a good fellow who's about deep learning, but who also has an initial part where machine learning is nicely explained. There's a lot of other places where you can learn about these topics, but this one is particularly nice. Then there is also a lot of popular software packages where you can start working with, that you can use to start working with machine learning for most notably TensorFlow, developed at Google, Tiano and Keras is also pretty nice. So the applications that I go into discuss are all more or less very recent. Here there's a partial list of the things I'm going to discuss as an application to many body and quantum many body. A few of them are from our group others are from Roger Meco group at Perimeter and University of Waterloo. So what is machine learning? I'm pretty sure that most of you have ever heard what machine learning is or know a little bit about machine learning. But let me tell you first of all that if you don't know about that, at least you have interacted with machine learning applications without knowing a lot of times in your life. So for example, if you think about Facebook, Google or many other applications online, Google Translate, all those things are based on advanced machine learning techniques and most of them are becoming increasingly powerful in recent years. Mostly thanks to deep learning, which is one of the advances in the field. But before going into those details and also understanding how we can use those technologies and sciences, let's start from the very beginning. And actually you might be surprised to know that probably one of the Godfathers of machine learning is David Hilbert. So I'm sure that you know David Hilbert because of the Hilbert space and its application in quantum physics. But David Hilbert also has a nice story about machine learning. And the idea is that at the beginning of the last century, he published 23 problems, so the hardest problems at that time. So among these 23 problems, the 13th problem of David Hilbert, goes as follows. So imagine that you have a polynomial of a seven-order degree, right? So for example this one, okay? And then you want to find the solutions of this equation. It is known from the works of Galois and others that for any polynomial which is of a order higher than three, you cannot find a solution which is in a simple algebraic form, right? So the question that David Hilbert was asking is whether the solution of this system can be written as the composition of simple functions. So for example, he was asking whether the solution of this system can be written, for example, as a function, as a high-dimensional function of a, b, and c, which by itself is, let's say, a simple function of a function of a, b, or, I mean, a simple composition of easier function, most notably a one-dimensional function of all the variable. So this question, which is rather abstract and has nothing to do with probably concrete applications, actually is pretty interesting and had a lot of consequences. And a few years later, there have been two other important personalities in physics which are Komogorov and Arnold, who showed one, I believe, one of the most beautiful theorems I'm aware of in mathematics, but that's a personal taste. And the theorem goes as follows. So one of the refinements of the theorem, but the basic idea is the same. So imagine that you have a high-dimensional function, which depends on a lot of variables. So you have a high-dimensional function, and you assume that this function is continuous and all of that. For example, you can imagine that this function is the solution of this equation, depending on those parameters. And so what they managed to show is basically, and I mean, in the refinements which were developed in the 60s, is that this function can be written as basically just a composition, a finite composition of some other functions. So in particular, this function phi that I'm going to define in a moment. And then we have some these other little phi functions of xp plus lambda plus eta q, plus another parameter, which is called q. Okay, so what do we have here? So on the left-hand side, we have a complicated high-dimensional function of n variable, so a nightmare, something which is very hard to approximate, typically. And this theorem states that you can approximate a very complicated high-dimensional function with just the composition of two one-dimensional functions. So what you need is just a capital phi of x, let's say, so one-dimensional function. And another function, which is also as the property which is bounded actually between 0 and 1. So this is the property that they managed to show. So both of these functions, which are the unknown, if you want, that you have to determine, are continuous functions of their argument, and they are very nice regularity properties, they are lip sheets and other things. So what it means is that you can approximate any continuous function with just two one-dimensional functions and a bunch of parameters. So those are the parameters actually in particular, this lambda and this eta. So these are the things that you have to determine. So again, this is very nice, and I believe it's particularly powerful because it somehow demystifies the complexity of a high-dimensional function. So a high-dimensional function can be written as a composition of a very simple one-dimensional function and only n on them. Now, what does this have to do with Google Translate and these other applications I was mentioning at the beginning? Well, the nice thing about those applications is that in general you can see them as a high-dimensional function. So for example, let's imagine that you have a picture, and this picture has been digitalized. So for example, on each pixel you can have basically a number. So let's say x, let's call it x of y. And this number, for example, you can just take it between 0 and 1. And it's basically the intensity of the light on that particular pixel. And this way of representing images is pretty general, for example, black and white images. So now let's say you have this image and you want to know as an application of your machine learning technique whether in this image there is a person you know. So you would like to devise a complicated machine if you want that takes a picture, this picture, and understands and tells you with what probability a person that you know, that you have somehow shown before to the machine, is in this picture or not. So you want that this algorithm outputs, if you want, some p in there, if you want. Which takes these bits of the picture, so all this complicated, this high dimensional object, and returns exactly so this number, so this number which will be something between 0 and 1. So this is, for example, the case of an advanced machine learning application which can be mapped exactly on a simple high dimensional function, conceptually simple high dimensional function. Also the same thing is true for Google Translate. So in Google Translate you have a high dimensional function which takes as an input, like a string in Chinese, so a high dimensional string, and then outputs the same string in Italian, say. So in all of those applications you are dealing typically with a high dimensional function that does some job you want to do. Now it turns out that the goal of machine learning is to find the best high dimensional, in most cases, is to find the best high dimensional functions that realize some specific task, some possibly complex task that you have in mind, driving a car or solving Schrodinger's equation or other thing you would like to do. Now in order to do that we need, of course, other ingredients. It's not only enough to know this important theorem by Komogorov and Arnold, but at least this important theorem gives us an important hope. So no matter how complicated this function is, because you can imagine that devising a function that is able to drive a car must be really hard, right? But no matter how complicated it is you can still imagine that you can find an exact expression or an approximation of this function just in terms of much simpler one dimensional functions. So this is an important concept of hope that at least we can try to do something which is much simpler than working with this nightmare function. So and the idea of machine learning is exactly to find representations of these unknown functions in terms of simpler, in a sense, one dimensional objects. So this, as far as I know, this is valid for finite dimensional spaces, but probably can be a standard, I'm not aware of generalization, but maybe there are. Now the important thing is that the function is continuous, if it's not then it's problematic. Now, so let's see, so the fundamental object that we are going to deal with during these lectures, so these high dimensional functions that I want to determine won't be of this form. So historically people have not taken this, let's say, of the formal form for this high dimensional function, but they've taken another form which goes on the name of artificial neural networks. So the artificial neural networks are nothing but a high dimensional function of some high dimensional input. So it's just another way of writing those things, but in a slightly different way. So let's see how we can do that. So let's imagine again, so and let's use to do that also some graphical notation which might be helpful. So we can assume, for example, we can draw, for example, that my input, so my x variable if you want, is just a bunch of points like that. So let's say I have a four dimensional space, so these axes can be continuous variables, for example. And then, so what an artificial neural network does is that typically, so for example, a feed forward neural network, it takes this input and it forms typically some linear combination of this input. So that I call, so for example, we start with this initial axis and I can form a linear combination of this input variables that I call w, i, j. So with some weights that I have to the other mind, okay. So I take a linear combination with some weights that I want to the other mind, plus eventually, plus optionally some term, some constant term that is called in GERGON, the bias term. And those are the weights of the neural network, so-called, okay. So after we take this linear combination, which somehow is reminiscent of this linear combination we are taking inside this function, we apply a highly nonlinear function, so typically those functions phi are inspired by the biology to this linear combination and define another set of variables which I call y of y, y of j, okay. So you see what I'm doing here, I'm applying a vector matrix w on the initial state, which is a vector of variable x, and then the dimension of this index j is arbitrary. I just decided and it will effectively determine the structure of my network. So in this case then I will form these other variables which I call y of j, which are here, okay. For example, I can have y1, y2, y3, y4. They do not need to be exactly like the first set of variables. So this object here is called, for example, typically a layer in the neural network and these variables sometimes are called hidden variables. So I formed this thing and as I was saying, the motivation of doing so is biological, especially when you take those phi's, so typically you take those phi's to be, for example, the hyperbolic tangent of x or some other highly nonlinear functions. Sometimes you can take the so-called logistic function, so 1 plus the exponential of minus x. So all of these functions have the property that they basically are some, let's say, nonlinear functions of the argument x, but they have this behavior that they saturate to some constant value at both plus and minus infinity. And the idea, which comes from biology, is that when a neuron, so you can see those variables as neurons, when a neuron gets its signal, it can start spiking or not. So it means the signal only if, so the input, only if the combination of the incoming signal, which comes from all the other neurons, is larger than a certain threshold, okay? So the idea is that this thing activates only if this input is larger than a certain threshold. So this is why this thing has this form. Okay, so and then, I mean, you can play this game a lot of time and form a lot of layers. You can form another layer of z variables, so z1, z2, z3, and so on. Until you reach the final layer, which will be the layer where you basically, typically, I mean, in this simple example, you will only have a single output variable for your network. So the basic idea is then is therefore that you take your information, your initial values for the variables, for example, the colors or the intensity of the light in your image. And these variables go through a lot of layers where the information is somehow elaborated and goes through a lot of transformation, which eventually will output your result. So you can see that this thing can be potentially very complex and can output basically any meaningful result you want it to. And the mathematical basis for this thing to work is this theorem. So the fact that you can approximate any i-dimensional function basically as a composition of simpler one-dimensional functions. Because at the end, you can see that this neural network, so the function that represents this neural network that I call f of x, can be written, so for example, in this case, as something which will depend on these weights. So the weights are those objects. For example, this would be y11, y12. So these connections that I draw here represent those weights, so those couplings that I have in this matrix. So basically, at the end, all these functions will depend on a set of weights and on a bunch of biases that I call collectively W and B. And for example, at the end, we will have a function of another function, of another function, et cetera, et cetera. So basically, you can see a feed-forward deep neural network as a composition of a simpler one-dimensional function. Okay? So this was the preliminary, most of the preliminary notions we needed to start doing machine learning. So we need a machine. And artificial neural networks are the machine that I'm going to use. The next step then is to see how I can use this machine to do useful stuff, right? So I have this function which depends on those weights, on those connections, on those biases. And the problem now is to determine in some way those coefficients, those weights and biases. So there are different ways of doing that which corresponds to different philosophies in the realm of machine learning. So we have three ways basically of doing that. So the first philosophy that I'm going to discuss in this first lecture is called supervised learning. So supervised learning is the conceptually easier way of determining if you want those coefficients and of finding the best approximation for the function that solves one complicated task, okay? So the basic idea and the basic philosophy of those approaches is that you have a lot of data. So most of those things are data-driven and you need a lot of data to work with these techniques. So and in particular the data you assume you have in this case is a bunch of points. So let me call them, sorry, X1, X2, XNS. So each of these axes with the vector up, okay? For example, X1 will be a collection, some specific collection of my one-dimensional, multi-dimensional variables and X2 and so on and so forth. So we have a lot of those high dimensional inputs if you want that we've generated. So typically NS, so the number of samples that we have in this data set is much larger than one, but much larger. And then, so we have this data, so a lot of high dimensional configurations and on those cool data we also know what I call Y over 1, Y over 2, Y over NS. So what is this Y? So this is basically the ideal value that I expect from the function I want to reconstruct, okay? So say that I expect, for example, that from my ideal function, so this, so for example, for Google Translate I can give you something written in, so, F Google might be something which takes the word D, right? And translates it in Italian, so in this case might be ill, unless it's something different. Okay, so in this case you can think that D is, so this is my, this X vector, so the collection of those letters if you want. And this, and Y would be the output, the expected output on some known examples for this machine. So in this case then, so I can, so F ideal on those X's is equal to this Y I, okay? So in other words, you assume that you have a collection of known examples from which you already know the solution of your problem translating or doing other things. And then you assume that on those you know the value of this ideal function. So typically those Y's in the community are called labels just for, because somehow they label the examples that you have, okay? So then the problem of machine learning is in this case it's pretty simple at least conceptually, because what you want to do is that you have a machine. So again, you have this, for example, artificial neural network which depends on some variable X and parametrically on some parameters. So, for example, that I call P1 up to P and P, and you want that this function is as close as possible to this ideal function that you know on some specified, pre-specified point. So, I mean, this should already tell you that this is basically a function of fitting, right? So you want to determine these parameters by fitting those things and by varying P in order to meet for this thing to be as close as possible to this Y's. So in particular, what you can do is that you can define a simple loss function which will depend on those parameters. Which, for example, I mean, I can define as 1 over ns for the number of samples that I have here, the sum from i to ns of the difference, the norm of the difference between this Xi which will depend itself again on a lot of parameters, minus the expected value for my function. So let's say Yi, okay? For example, this can be the L2 norm, right? So in the ideal case in which F is exactly what I expected, this thing is clearly zero. So what I can do now is that I can devise an algorithm, so a learning algorithm, which solves this problem and finds the best possible values that minimize this loss function. So in this case, learning is just fitting, so also just to demystify a little bit the language that often is used in this community. The important difference, though, that I think it's particularly nice so that we can also interpret from a physical point of view is the way this fitting is done, which is highly some amount on trivial. And the way typically you would solve this problem is using what people call the stochastic gradient descent, okay? So this Y is stochastic in Y gradient descent. So you know that in general, if you have a high dimensional function and you want to find its minimum, the easiest thing you can do is to use the gradient descent approach, right? So you compute the gradient of this function with respect to the parameters, and then you do an update, for example. So you have an iterative approach, so you have your set of parameters at step K, and then you say that the parameters at step K plus 1 is equal to the set of parameters you have at step K minus some small parameter that I call it the gradient of the function, right? So for example, in this case, I would say the gradient with respect to P of my loss function, which depends on P, right? So this is perfectly legitimate, and that's what you do to find the... It's one of the worst algorithms you can imagine to find the minimum of a function, because typically it gets stuck when you're close to the minimum, right? So this is very well known. So in the machine learning community, they use this basic algorithm. So you might think that the most advanced community in numerics uses the worst possible algorithm in optimization. So why is it so? And the idea is that they use a modified version, actually, of this algorithm, in the sense that, for example, if we write the full gradient in this case, so let's say, let's call it G of GD, so the actual gradient of the gradient descent approach, which will depend on all the set of parameters. So for example, we can write it as just 1 over NS, so again the number of samples. The sum of the gradient of the individual, let's say basically, or the individual functions that we lost functions that we have here, right? So minus y, yi. So the gradient is basically the sum of all the small gradients, if you want on each of the points that I have in my data set, right? So that's very just to the gradient of this thing. That's the exact gradient. All of P is the... Ah, sorry, that's my fault. So this is an L, actually. It's a loss function. That's why it's called L, because it's very inefficient. As I was saying, the gradient descent is like the worst possible algorithm for optimizing things. So now let me explain why also people use this. So the idea then is that not to use this full expression for the gradient, but so this SGD approximation, stochastic gradient descent. So this approximation is just that instead of taking the full sum, you take only what people call a batch. So you just take a small number, a small subset of the points in the data set to compute this gradient. So, for example, I can say, so if we call NB, the sites of the batch, where NB typically is much smaller than the total number of samples, OK? Then this would be just basically the same thing, but summed only on those points that are in this batch. So just the gradient that we respect to P of this thing, OK? I'm not going to write it again. So instead of computing the full gradient on the full samples, we just compute it on a smaller set. So pictorially, here I have my full set of points and I have a lot of them. But I just pick randomly a few of them. And on those, with replacement, this is important. And on those, I compute my gradient. So using those, I compute my gradient, OK? So this is an approximation of the true gradient of the function. And it has a very nice interpretation in terms of statistics, if you want. So that's why this thing is called stochastic, because you can write the both the original and the approximate gradient as an expectation values. So if you think a little bit about it. And so let's assume that those points, those axes that are in my data set are distributed according to some probability distribution pi of x, OK? So this is the probability distribution. And we can assume that the data set that we have for the ensemble of points that I have here. So this is my data set is basically drawn randomly, uniformly from this probability distribution pi, OK? So then you immediately see that this gradient, if NS is large enough, so the exact gradient, so G of GDE, basically can be written approximately as a statistical expectation value over this probability distribution P of those small functions here, right? So of the gradient on each of those configurations, so F, if you want, of X. So you can see that this is basically just an average. It's the average of this object, which depends on X over a lot of samples, X1, X2, X3, XM, OK? Now what happens is that instead of taking this expression for the gradient, which would be the exact expression in this case, we are taking a smaller one. So we are computing this average on as much smaller subset. So now what happens is that this SGD then, so this expectation value over a smaller subset, so this is on the full data set, right? Will be basically the same thing, but on a much smaller data set. So which means that the SGD for the low or large numbers basically will be equal to the exact gradient of my original function. Oh, sorry, I don't know how to, I need the machine learning to write for me. So this SGD gradient, so this expression for the gradient will be equal to the exact gradient, plus a correction, which is stochastic in the sense that we are using a smaller subset. And you will have an error, a statistical error, which is distributed like, because again of the law of large numbers, like a normal variable with some variance sigma. So you will have a mean value, which is basically computed this way, plus a variance. So you will have a noisy variable, so that's why this process is called a stochastic gradient system. However, we can interpret this process, so this in particular, this iteration we are doing here, so the way we are changing the parameters in a very physical way. So at the end, we are physicists, so we want to understand what's going on from a better perspective. And in particular, it turns out that that iteration we are doing, so where we take a smaller batch than the original one, is nothing but solving the Langevin equation, right? So do you know about the Langevin equation? So this is one of the most famous stochastic equations in physics. And in particular, for example, basically in this case, so it would correspond to solving the following iteration. So you go from, so you say that your parameters at step s plus one, so this is a version of this stochastic equation is the following. Minus delta tau, which is now a time step. Is equal to, so basically the gradient with respect to those parameters of some energy function, which depends on those p, for example, plus the normal, so a normal distribution of variance equal to, so the square root of two delta tau, the temperature. So what am I doing here? So I'm thinking that I have an ensemble of particles, classical particles, whose coordinates are described by those parameters basically. So I imagine that my parameters are equivalent in a sense to classical particles that can move under the action of an energy potential E of p, okay? So p is not a momentum, it's just an arbitrary high dimensional name. And t is the temperature in which these particles are moving in this bath, if you want, in this classical bath at finite temperature, t, okay? So this is, if you want, the finite temperature equivalent of Newton's equation. So you go at the next time and you update your positions in this way. And at the end you can show that the probability distribution according to which those parameters p are distributed is this Langevin probability distribution. So at the end, if you average over a long time, you can show that basically you are sampling from the Boltzmann distribution. So the probability of observing a certain value for those coordinates, so for those parameters, for example, in my network, is exactly equal. So in the longer time limit, so when this time is very long, to this energy function over kBt, okay? So in this way, with this connection, we can immediately connect the two equations. Well, not immediately, but making some extra assumptions that it's mostly useful. Exactly, exactly. So the temperature will depend like basically 1 over Ns, the number of sums. But this is in a particular approximation that I have to do. So the approximation that I am doing now is that all those variables are uncorrelated. Otherwise, you have to take care of the covariance as we know. So if those variables p are uncorrelated, so if there are variants over that sample, basically, if they are just independent variables, then in this case, I can identify the two things. And in particular, in this case, I can say that the temperature, so the effective temperature in my original problem, goes like eta, the variance of, so basically the gradient of my loss function. So I'm also assuming that just I have one single temperature, so I'm avoiding a little bit of complications. And so it's equal to divided by two, but okay, this is just a constant. And then that, and so at the end, since the variance itself is proportional to 1 over the number of samples, as you were saying, this error, so this temperature will go like eta, Ns, and some intrinsic number, which is, yeah. So this is actually the variance of the expectation value. So this is the variance of the variable itself. So I mean, the important point at the end is that the effective temperature or these particles that are moving under the action of this loss function, if you want, is exactly proportional, inversely proportional to the number of samples. So this means that the more samples I put in this batch, so this is Nb, yeah. So the more samples I put in my batch, the lower the temperature, which I'm simulating. And so this also tells you that if I want to find the ground state, so if I want to find the minimum energy sector of this loss function, I have to decrease the temperature with time if you want. Because I want to anneal if you want my system into the ground state. So and that's what people do in the community of, you will be, you'll see that this is done in the community of it's implemented in some codes like TensorFlow, et cetera. And there are two ways of reducing the temperature as a function of the iteration count. So the first one is to increase the number of samples that you put in the batch. So this is typically one thing that we do in our community, but not in the other, not in the machine learning community. In the machine learning community what they typically do is that they define a step dependent parameter eta. So I remind you that eta is the coefficient you put in front of the gradient. So it tells you how fast you are changing your parameters. And in the community it's called the learning rate, okay? So what do they say that they define a step dependent learning rate which typically goes like a constant divided by s, basically s plus one if you want. So basically what you do is that you reduce this learning rate and you reduce the effective temperature converging to the ground state, trying to converge to the ground state. And you have to do that otherwise you get stuck in a high temperature state. Now, so to come back to the original question, why you want to use this thing and not the standard gradient descent? Well, what we are doing here is that we are doing a simulated annealing. So for those of you who know what simulated annealing is, it's a strategy to find the minimum of a high dimensional function. Just assuming that's mapping this high dimensional function to the energy of an equivalent classical system. And then starting with the high temperature, which will define a characteristic initial energy. And then lowering this temperature such that at the end, if you lower enough your temperature and you do that slowly enough, you will find the minimum of this function. And the important thing is that this approach finds the global minimum of this high dimensional function. Not only one local minimum as would the standard gradient descent approach. So here we have a double benefit then from using this stochastic approach. So the first one is that it is much faster because instead of summing over all the samples I sum over just as a smaller subset. And the second one is that I find not only the minimum, but typically I can find the global minimum. So this is a very important advantage. And that's why also people in this community use this kind of techniques. Yes, so here implicitly I'm telling you what the speed is. So here I'm telling you that I'm changing. I'm annealing if you want this parameter, this learning rate with this rate which is 1 over the square root of s. So there's a theorem that guarantees that if you anneal this thing at least, I think like 1 over s or something like that. I don't remember the exact exponent. But there is a theorem which guarantees that you will converge to the minimum. If you anneal this thing at least with some power. Now I don't remember the exact power. Of course in practice there's a lot of things that can go terribly wrong and those theorems are just a basis to understand a little bit the theorem. Not at all, no. Typically you don't know that. You can, you typically converge to a very good minimum, but then you don't know that there's nothing further. So, yeah. But certainly it's much better than using the standard gradient descent. So, but as you, as you've already understood, this is the hard part in the machine learning community. So the machine, so the fact that we can represent any arbitrary high dimensional function is guaranteed by theorems. So there's nothing terribly wrong that can go wrong there. So the thing which is hard is that is finding those parameters. So it's finding those one dimensional functions is finding those weights that enter in the machine. So this is the computationally hard part and this is the, if you wanted a hempy, hard part of the problem. But again, it's something that people have learned to do in a nice way and we can use all the advantages that people have used in the field. So let's see a little bit how these applications go. So how much time do they have left? Five minutes, interesting. Okay, so let's see. So, okay, so I mean people have started using also this application in science, right? And like two months ago, science published this issue saying that AI is amount transforming science in the sense that if you have, if you apply those techniques, so if you have an automatic artificial intelligence, which is able to solve the complicated problems like translating, but applicable to other domains in science, then it's clear that we can, we should also rethink about the role of the scientists in science because we might have at some point an automatic intelligence and artificial intelligence, which would be able to solve complicated problems for us. We are not there yet, so there is still job for physicists and mathematicians, fortunately, but I mean, we are going there. Now, an example I wanted to do of this application. So this supervised learning is the standard example in the community. So first like an actual application, not to science, which is to the so-called mini database. So this database basically, in this database we have as a data set an ensemble of images, right? So each image is one of those axes that I was describing at the beginning and the pixels in those images are design dimensional functions, design dimensional variables. Now, for each of those that I have a bunch of digits that they asked some high school people in like a few years ago to write down. So we have like thousands of images written by high school students. And for each of those, we also know pre-associated like a label. So we know that this first one is a five, the second is a zero and so on and so forth. So what we can do and using again, as I was mentioning, this representation of the digit as a high dimensional variable X, of this case it's in a space of 784 dimensions. We can then feed basically these those things to a deep network like the one that I was describing at the beginning and train the parameters. So in this case, the network will have an input layer which are basically the pixels, 784 pixels in my image, which will undergo a series of transformation according to the weights that have to be determined. And then at the end, the output neuron, so the output of this network will be the number that I want to be identified by my machine. So the idea is that I can use then this machine to convert handwritten numbers that had never been seen before by the machine into actual numbers on a computer. So this is particularly useful for real applications. And this is done, you can find, for example, some tutorials. I'm not going to discuss them in particular on TensorFlow. You can find very nice applications of that. And you can find that this machine can achieve like 92, 93, actually no, sorry, 99.5% accuracy in identifying never seen before numbers. So it's very impressive. Yes, but it's written by different people with different handwriting. So the important point is that I can write this number. And my number is clearly not part of that data set that contributed to the learning and the machine is able to understand it, that's the point. Now let me just flash an application to physics. So the first application that you can think for physics, which has been recently, is to classify phases of matter. So in this case, your data set might be, let's say, a bunch of figures or something of matter. So for example, this X1 could be a solid. So you know in advance that this is a solid, this is a liquid, solid, solid, liquid. So and you have a lot of samples, let's say, that you have measured in the lab or whatever. And then you can train your machine to identify a liquid from a solid, if you want. And you can, for example, give something, some picture that has never been shown before. Ah, sorry, what's going on? Okay, so for example, the picture of this glass and transform it into some known, some phase. So I give you this picture of the glass and I ask the machine to identify the phase. Now, how does this work, for example, in the case of actual models? This is what has been demonstrated in this paper by Karasky and Melko. Karasky is a former student of ICTP, so you might want to know that. So the idea is that they showed this method at work. So this supervised training at work in the icing model. So basically, they take a simple icing model, two-dimensional icing model. They generate a lot of snapshots of images of the partition function of this system. They transform those snapshots into images and they feed them to an artificial neural network like very similar to the one used to classify the numbers. And to each of those samples, they associate the label. For example, this is in the ordered phase. These two are in the ordered phase. This is the ferromagnetic phase. And these other two samples are above the critical temperature, so they are in the disordered phase. So what they managed to show is that then the machine, like basically trained in the same way we trained the other problem, is able to recognize with a very high accuracy, more than 99%, the phase transition in the problem. So it is able to identify whether a new image is in the ordered or disordered phase. Most importantly, they can apply the same technique to other models, not all in square lattice, but for example, the triangular lattice. And the machine knows even on these other lattice whether the phase transition is. So he knows that, for example, the critical temperature is different. So here is 2. something and here is 3.6 something. So it is, in this case, clear that the machine has learned the order parameter for this phase transition. So this was just one of the first applications that we'll see. We will see in the following other things, also especially in the context of quantum physics. So thank you.